You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by James Pine <ge...@yahoo.com> on 2006/07/05 20:07:19 UTC

BitSet in a HitCollector

Hey Everyone,

I'm using a HitCollector and would like to know the
total number of results that matched a given query.
Based on the JavaDoc, I this will do the trick:

Searcher searcher = new IndexSearcher(indexReader);
   final BitSet bits = new
BitSet(indexReader.maxDoc());
   searcher.search(query, new HitCollector() {
       public void collect(int doc, float score) {
         bits.set(doc);
       }
     });
int numResults = bits.cardinality();

If I want to know the total number of results inside
of the HitCollector, i.e. before the collect method
has ever been called, I think I could pass the Query
and Searcher objects into the HitCollector and do this
in its constructor:

BitSet bits = (new
QueryFilter(query)).bits(searcher.getIndexReader());
int numResults = bits.cardinality();

Is there a performance penalty using the QueryFilter?
Is Lucene executing another pass over the index in
order to populate the BitSet and then doing another
pass while calling the collect method? Thanx.

JAMES



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BitSet in a HitCollector

Posted by Tricia Williams <pg...@student.cs.uwaterloo.ca>.
Hi James,

    A paper was mentioned on this list in the last couple of months which 
presents a solution to your sampling problem without having to know the 
total results size in advance.  The paper 
(http://www2005.org/cdrom/docs/p245.pdf) presents two solutions which 
utilize a random variable.  One solution has you traverse the result set 
and select each document with probability p.  P is determined in advance. 
Alternately, the paper describes an algorithm (bottom of page 248) for 
determining a skip value which, while similar to the traversal, allows you 
to jump/skip over documents and save the probability computations for each 
document required by the first solution.

    I hope this helps!

Tricia

On Thu, 6 Jul 2006, James Pine wrote:

> Hey,
>
> Sorry, I will explain a bit more about my collect
> method. Currently my collect method is executing
> IndexSearcher.doc(id) and storing some stuff in a Map
> which I can then retrieve from the HitCollector (much
> like the example in the Lucene In Action book). Of
> course that's somewhat expensive, so I'd like to do
> some statistical sampling based on the result set size
> to try and speed things up.
>
> The way I was thinking about doing this was, during
> the collect method only executing
> IndexSearcher.doc(id) on every Nth document, where N
> is calculated dynamically based on a minimum number X.
> The rule would be:
>
> N = Max(1,(numResults / X))
>
> In order to do this in the collect method, I need to
> know the total number of results before ever invoking
> the collect method right? That seemed to make a case
> for the BitSet/QueryFilter in the constructor.
>
> In addition, someone else on the list mentioned that
> one of the reasons calling IndexSearcher.doc(id) in
> the collect method was that it caused the disk to do a
> lot of seeking. Maybe that's a moot point if one is
> using a RAMDirectory or an FSDirectory small enough
> that it gets cached by the OS anyway, but if it's not,
> then I thought it might be more performant to have the
> hitcollector set the Bits in the collect method and
> then do another pass to do the statistical sampling.
>
> Either way it seems that to do the statistical
> sampling that I envision I either need to calculate
> the total result count/document id set in the
> constructor, before calling the collect method, or
> calculate the total result count/document id set in
> the collect method and then execute some sort of
> post-collect method, right? So I was just wondering
> which method was better/faster. Thanx.
>
> JAMES
>
> --- Chris Hostetter <ho...@fucit.org> wrote:
>
>>
>> : I'm using a HitCollector and would like to know
>> the
>> : total number of results that matched a given
>> query.
>> : Based on the JavaDoc, I this will do the trick:
>>
>> you don't need a BitSet in that case, you could find
>> that out just using
>> an int...
>>
>>     public CountingCollector extends HitCollector {
>>       public int count = 0;
>>       public void collect(int doc, float score) {
>> count++ };
>>     }
>>     CountingCollector c = new CountingCollector();
>>     searcher.search(query, c)
>>     int numResults = c.count;
>>
>> : If I want to know the total number of results
>> inside
>> : of the HitCollector, i.e. before the collect
>> method
>> : has ever been called, I think I could pass the
>> Query
>> : and Searcher objects into the HitCollector and do
>> this
>> : in its constructor:
>> :
>> : BitSet bits = (new
>> :
>> QueryFilter(query)).bits(searcher.getIndexReader());
>> : int numResults = bits.cardinality();
>>
>> This question doesn't make a lot of sense to me, why
>> do you need to know
>> the total number ofresults before the collect method
>> is called? .. what
>> you are suggesting here (using QueryFilter in this
>> way) is perfectly
>> legal, but it's going to do just as much work as
>> using a HitCollector will
>> (possibly more, i can't remember).
>>
>> : Is Lucene executing another pass over the index in
>> : order to populate the BitSet and then doing
>> another
>> : pass while calling the collect method? Thanx.
>>
>> in your last example, you never us your
>> HitCollector, so i'm not sure what
>> you mean, but assuming you aresking about combining
>> those examples into
>> something like this....
>>
>>   Searcher searcher = new
>> IndexSearcher(indexReader);
>>   BitSet bits = (new
>> QueryFilter(query)).bits(searcher.getIndexReader());
>>   final int numResults = bits.cardinality();
>>   searcher.search(query, new HitCollector() {
>>        public void collect(int doc, float score) {
>>           /* do something with numResults and doc
>> and score */
>>        }
>>   });
>>
>> ...then yes, you are most definitely making two
>> passes to do do that.
>>
>>
>>
>> -Hoss
>>
>>
>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BitSet in a HitCollector

Posted by James Pine <ge...@yahoo.com>.
Hey,

Sorry, I will explain a bit more about my collect
method. Currently my collect method is executing
IndexSearcher.doc(id) and storing some stuff in a Map
which I can then retrieve from the HitCollector (much
like the example in the Lucene In Action book). Of
course that's somewhat expensive, so I'd like to do
some statistical sampling based on the result set size
to try and speed things up.

The way I was thinking about doing this was, during
the collect method only executing
IndexSearcher.doc(id) on every Nth document, where N
is calculated dynamically based on a minimum number X.
The rule would be:

N = Max(1,(numResults / X))

In order to do this in the collect method, I need to
know the total number of results before ever invoking
the collect method right? That seemed to make a case
for the BitSet/QueryFilter in the constructor.

In addition, someone else on the list mentioned that
one of the reasons calling IndexSearcher.doc(id) in
the collect method was that it caused the disk to do a
lot of seeking. Maybe that's a moot point if one is
using a RAMDirectory or an FSDirectory small enough
that it gets cached by the OS anyway, but if it's not,
then I thought it might be more performant to have the
hitcollector set the Bits in the collect method and
then do another pass to do the statistical sampling. 

Either way it seems that to do the statistical
sampling that I envision I either need to calculate
the total result count/document id set in the
constructor, before calling the collect method, or
calculate the total result count/document id set in
the collect method and then execute some sort of
post-collect method, right? So I was just wondering
which method was better/faster. Thanx.

JAMES

--- Chris Hostetter <ho...@fucit.org> wrote:

> 
> : I'm using a HitCollector and would like to know
> the
> : total number of results that matched a given
> query.
> : Based on the JavaDoc, I this will do the trick:
> 
> you don't need a BitSet in that case, you could find
> that out just using
> an int...
> 
>     public CountingCollector extends HitCollector {
>       public int count = 0;
>       public void collect(int doc, float score) {
> count++ };
>     }
>     CountingCollector c = new CountingCollector();
>     searcher.search(query, c)
>     int numResults = c.count;
> 
> : If I want to know the total number of results
> inside
> : of the HitCollector, i.e. before the collect
> method
> : has ever been called, I think I could pass the
> Query
> : and Searcher objects into the HitCollector and do
> this
> : in its constructor:
> :
> : BitSet bits = (new
> :
> QueryFilter(query)).bits(searcher.getIndexReader());
> : int numResults = bits.cardinality();
> 
> This question doesn't make a lot of sense to me, why
> do you need to know
> the total number ofresults before the collect method
> is called? .. what
> you are suggesting here (using QueryFilter in this
> way) is perfectly
> legal, but it's going to do just as much work as
> using a HitCollector will
> (possibly more, i can't remember).
> 
> : Is Lucene executing another pass over the index in
> : order to populate the BitSet and then doing
> another
> : pass while calling the collect method? Thanx.
> 
> in your last example, you never us your
> HitCollector, so i'm not sure what
> you mean, but assuming you aresking about combining
> those examples into
> something like this....
> 
>   Searcher searcher = new
> IndexSearcher(indexReader);
>   BitSet bits = (new
> QueryFilter(query)).bits(searcher.getIndexReader());
>   final int numResults = bits.cardinality();
>   searcher.search(query, new HitCollector() {
>        public void collect(int doc, float score) {
>           /* do something with numResults and doc
> and score */
>        }
>   });
> 
> ...then yes, you are most definitely making two
> passes to do do that.
> 
> 
> 
> -Hoss
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BitSet in a HitCollector

Posted by Chris Hostetter <ho...@fucit.org>.
: I'm using a HitCollector and would like to know the
: total number of results that matched a given query.
: Based on the JavaDoc, I this will do the trick:

you don't need a BitSet in that case, you could find that out just using
an int...

    public CountingCollector extends HitCollector {
      public int count = 0;
      public void collect(int doc, float score) { count++ };
    }
    CountingCollector c = new CountingCollector();
    searcher.search(query, c)
    int numResults = c.count;

: If I want to know the total number of results inside
: of the HitCollector, i.e. before the collect method
: has ever been called, I think I could pass the Query
: and Searcher objects into the HitCollector and do this
: in its constructor:
:
: BitSet bits = (new
: QueryFilter(query)).bits(searcher.getIndexReader());
: int numResults = bits.cardinality();

This question doesn't make a lot of sense to me, why do you need to know
the total number ofresults before the collect method is called? .. what
you are suggesting here (using QueryFilter in this way) is perfectly
legal, but it's going to do just as much work as using a HitCollector will
(possibly more, i can't remember).

: Is Lucene executing another pass over the index in
: order to populate the BitSet and then doing another
: pass while calling the collect method? Thanx.

in your last example, you never us your HitCollector, so i'm not sure what
you mean, but assuming you aresking about combining those examples into
something like this....

  Searcher searcher = new IndexSearcher(indexReader);
  BitSet bits = (new QueryFilter(query)).bits(searcher.getIndexReader());
  final int numResults = bits.cardinality();
  searcher.search(query, new HitCollector() {
       public void collect(int doc, float score) {
          /* do something with numResults and doc and score */
       }
  });

...then yes, you are most definitely making two passes to do do that.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org