You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by lionel duboeuf <li...@boozter.com> on 2010/02/05 10:27:28 UTC

Document Frequency for a set of documents

Hi,

Sorry for asking again, **I still have not found a scalable solution to 
get the document frequency of a term t according a set of documents. 
Lucene only store the document frequency for the global corpus, but i 
would like to be able to get the document frequency of a term according 
only to a subset of documents (i.e. a user's collection of documents).

I guess that querying the index to get the number of hits for each term 
and for each field,  filtered by a user will be to slow.
Any idea ?


regards,

Lionel

*
*

Re: Document Frequency for a set of documents

Posted by lionel duboeuf <li...@boozter.com>.

Uwe Schindler a écrit :
> How about having more than one index, so one for each user? If you want to do a search on all use a MultiReader on all separate indexes? If you want only serach on a subset use the corresponding index' IndexReader instead.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>   
Well, i expect to have more than 1 million of users. I think this will 
not be a scalable solution.
Thanks

lionel

RE: Document Frequency for a set of documents

Posted by Uwe Schindler <uw...@thetaphi.de>.

How about having more than one index, so one for each user? If you want to do a search on all use a MultiReader on all separate indexes? If you want only serach on a subset use the corresponding index' IndexReader instead.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: lionel duboeuf [mailto:lionel.duboeuf@boozter.com]
> Sent: Friday, February 05, 2010 10:27 AM
> To: general@lucene.apache.org
> Subject: Document Frequency for a set of documents
> 
> Hi,
> 
> Sorry for asking again, **I still have not found a scalable solution to
> get the document frequency of a term t according a set of documents.
> Lucene only store the document frequency for the global corpus, but i
> would like to be able to get the document frequency of a term according
> only to a subset of documents (i.e. a user's collection of documents).
> 
> I guess that querying the index to get the number of hits for each term
> and for each field,  filtered by a user will be to slow.
> Any idea ?
> 
> 
> regards,
> 
> Lionel
> 
> *
> *
>

Re: Document Frequency for a set of documents

Posted by lionel duboeuf <li...@boozter.com>.

Thanks ard for your response,i found it usefull.

regards.
lionel

Ard Schrijvers a écrit :
> crossposting to the user list as I think this issue belongs there. See
> my comments inline
>
> On Fri, Feb 5, 2010 at 10:27 AM, lionel duboeuf
> <li...@boozter.com> wrote:
>   
>> Hi,
>>
>> Sorry for asking again, **I still have not found a scalable solution to get
>> the document frequency of a term t according a set of documents. Lucene only
>> store the document frequency for the global corpus, but i would like to be
>> able to get the document frequency of a term according only to a subset of
>> documents (i.e. a user's collection of documents).
>>
>> I guess that querying the index to get the number of hits for each term and
>> for each field,  filtered by a user will be to slow.
>> Any idea ?
>>     
>
> I have recently developed out-of-the-box faceted navigation exposed
> over jcr (hippo repository on top of jackrabbit) where I think you are
> looking for efficient faceted navigation as well, right? First of all,
> I am also interested if others have something to add to my findings.
>
> First of all, you can approach your issue in two different angles,
> where I think depending on the number of results vs number of terms
> (unique facets), you can best switch (runtime) between the two
> approaches:
>
> Approach (1): The lucene TermEnum is leading: if the lucene field has
> *many* (say more then 100.000) unique values, it becomes slow (and
> approach two might be better)
>
> You have a BitSet matchingDocs, and you want the count for all the
> terms for field 'brand' where of course one of the documents in
> matchingDocs should have the term:
> Suppose your field is thus 'brand', then you can do:
>
>            TermEnum termEnum = indexReader.terms(new Term("brand", ""));
>             // iterate through all the values of this facet and see
> look at number of hits per term
>
>             try {
>                 TermDocs termDocs = indexReader.termDocs();
>                 // open termDocs only once, and use seek: this is more efficient
>                 try {
>                     do {
>                         Term term = termEnum.term();
>                         int count = 0;
>                         if (term != null && term.field() ==
> internalFacetName) { // interned comparison
>
>                             termDocs.seek(term);
>                             while (termDocs.next()) {
>                                 if (matchingDocs.get(termDocs.doc())) {
>                                     count++;
>                                 }
>                             }
>                             if (count > 0) {
>                                 if (!"".equals(term.text())) {
>
> facetValueCountMap.put(term.text(), new Count(count));
>                                 }
>                             }
>
>                         } else {
>                             break;
>                         }
>                     } while (termEnum.next());
>                 } finally {
>                     termDocs.close();
>                 }
>             } finally {
>                 termEnum.close();
>             }
>
> Approach (2): matching docs are leading. All lucene fields that should
> be useable for your facet counts, must be indexed with TermVectors.
> This approach becomes slow when the matching docs grow > 100.000 hits.
> Then, you rather use approach (1)
>
> Create your own HitCollector, and have its hit method something like:
>
> public final void collect(final int docid, final float score) {
>         try {
>             if (facetMap != null) {
>                 final TermFreqVector tfv =
> reader.getTermFreqVector(docid, internalName);
>                 if (tfv != null) {
>                     for (int i = 0; i < tfv.getTermFrequencies().length; i++) {
>                         addToFacetMap(tfv.getTerms()[i]);
>                     }
>
>
> Note that the HitCollector's are not advised for large hit sets, also see [1]
>
> This is how i currently have a really performant faceted navigation
> exposed as a jcr tree. If somebody has tried more ways, or something
> to add, I would be interested
>
> Regards Ard
>
> [1] http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/HitCollector.html
>
>   
>> regards,
>>
>> Lionel
>>
>> *
>> *
>>
>>
>>
>>

Re: Document Frequency for a set of documents

Posted by Ard Schrijvers <a....@onehippo.com>.

crossposting to the user list as I think this issue belongs there. See
my comments inline

On Fri, Feb 5, 2010 at 10:27 AM, lionel duboeuf
<li...@boozter.com> wrote:
> Hi,
>
> Sorry for asking again, **I still have not found a scalable solution to get
> the document frequency of a term t according a set of documents. Lucene only
> store the document frequency for the global corpus, but i would like to be
> able to get the document frequency of a term according only to a subset of
> documents (i.e. a user's collection of documents).
>
> I guess that querying the index to get the number of hits for each term and
> for each field,  filtered by a user will be to slow.
> Any idea ?

I have recently developed out-of-the-box faceted navigation exposed
over jcr (hippo repository on top of jackrabbit) where I think you are
looking for efficient faceted navigation as well, right? First of all,
I am also interested if others have something to add to my findings.

First of all, you can approach your issue in two different angles,
where I think depending on the number of results vs number of terms
(unique facets), you can best switch (runtime) between the two
approaches:

Approach (1): The lucene TermEnum is leading: if the lucene field has
*many* (say more then 100.000) unique values, it becomes slow (and
approach two might be better)

You have a BitSet matchingDocs, and you want the count for all the
terms for field 'brand' where of course one of the documents in
matchingDocs should have the term:
Suppose your field is thus 'brand', then you can do:

           TermEnum termEnum = indexReader.terms(new Term("brand", ""));
            // iterate through all the values of this facet and see
look at number of hits per term

            try {
                TermDocs termDocs = indexReader.termDocs();
                // open termDocs only once, and use seek: this is more efficient
                try {
                    do {
                        Term term = termEnum.term();
                        int count = 0;
                        if (term != null && term.field() ==
internalFacetName) { // interned comparison

                            termDocs.seek(term);
                            while (termDocs.next()) {
                                if (matchingDocs.get(termDocs.doc())) {
                                    count++;
                                }
                            }
                            if (count > 0) {
                                if (!"".equals(term.text())) {

facetValueCountMap.put(term.text(), new Count(count));
                                }
                            }

                        } else {
                            break;
                        }
                    } while (termEnum.next());
                } finally {
                    termDocs.close();
                }
            } finally {
                termEnum.close();
            }

Approach (2): matching docs are leading. All lucene fields that should
be useable for your facet counts, must be indexed with TermVectors.
This approach becomes slow when the matching docs grow > 100.000 hits.
Then, you rather use approach (1)

Create your own HitCollector, and have its hit method something like:

public final void collect(final int docid, final float score) {
        try {
            if (facetMap != null) {
                final TermFreqVector tfv =
reader.getTermFreqVector(docid, internalName);
                if (tfv != null) {
                    for (int i = 0; i < tfv.getTermFrequencies().length; i++) {
                        addToFacetMap(tfv.getTerms()[i]);
                    }


Note that the HitCollector's are not advised for large hit sets, also see [1]

This is how i currently have a really performant faceted navigation
exposed as a jcr tree. If somebody has tried more ways, or something
to add, I would be interested

Regards Ard

[1] http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/HitCollector.html

>
>
> regards,
>
> Lionel
>
> *
> *
>
>
>

Re: Document Frequency for a set of documents

Posted by Ard Schrijvers <a....@onehippo.com>.

crossposting to the user list as I think this issue belongs there. See
my comments inline

On Fri, Feb 5, 2010 at 10:27 AM, lionel duboeuf
<li...@boozter.com> wrote:
> Hi,
>
> Sorry for asking again, **I still have not found a scalable solution to get
> the document frequency of a term t according a set of documents. Lucene only
> store the document frequency for the global corpus, but i would like to be
> able to get the document frequency of a term according only to a subset of
> documents (i.e. a user's collection of documents).
>
> I guess that querying the index to get the number of hits for each term and
> for each field,  filtered by a user will be to slow.
> Any idea ?

I have recently developed out-of-the-box faceted navigation exposed
over jcr (hippo repository on top of jackrabbit) where I think you are
looking for efficient faceted navigation as well, right? First of all,
I am also interested if others have something to add to my findings.

First of all, you can approach your issue in two different angles,
where I think depending on the number of results vs number of terms
(unique facets), you can best switch (runtime) between the two
approaches:

Approach (1): The lucene TermEnum is leading: if the lucene field has
*many* (say more then 100.000) unique values, it becomes slow (and
approach two might be better)

You have a BitSet matchingDocs, and you want the count for all the
terms for field 'brand' where of course one of the documents in
matchingDocs should have the term:
Suppose your field is thus 'brand', then you can do:

           TermEnum termEnum = indexReader.terms(new Term("brand", ""));
            // iterate through all the values of this facet and see
look at number of hits per term

            try {
                TermDocs termDocs = indexReader.termDocs();
                // open termDocs only once, and use seek: this is more efficient
                try {
                    do {
                        Term term = termEnum.term();
                        int count = 0;
                        if (term != null && term.field() ==
internalFacetName) { // interned comparison

                            termDocs.seek(term);
                            while (termDocs.next()) {
                                if (matchingDocs.get(termDocs.doc())) {
                                    count++;
                                }
                            }
                            if (count > 0) {
                                if (!"".equals(term.text())) {

facetValueCountMap.put(term.text(), new Count(count));
                                }
                            }

                        } else {
                            break;
                        }
                    } while (termEnum.next());
                } finally {
                    termDocs.close();
                }
            } finally {
                termEnum.close();
            }

Approach (2): matching docs are leading. All lucene fields that should
be useable for your facet counts, must be indexed with TermVectors.
This approach becomes slow when the matching docs grow > 100.000 hits.
Then, you rather use approach (1)

Create your own HitCollector, and have its hit method something like:

public final void collect(final int docid, final float score) {
        try {
            if (facetMap != null) {
                final TermFreqVector tfv =
reader.getTermFreqVector(docid, internalName);
                if (tfv != null) {
                    for (int i = 0; i < tfv.getTermFrequencies().length; i++) {
                        addToFacetMap(tfv.getTerms()[i]);
                    }


Note that the HitCollector's are not advised for large hit sets, also see [1]

This is how i currently have a really performant faceted navigation
exposed as a jcr tree. If somebody has tried more ways, or something
to add, I would be interested

Regards Ard

[1] http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/HitCollector.html

>
>
> regards,
>
> Lionel
>
> *
> *
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org