You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Antony Bowesman <ad...@teamware.com> on 2007/03/13 05:39:32 UTC

Performance between Filter and HitCollector?

There are (at least) two ways to generate a BitSet which can be used for filtering.

Filter.bits()

   BitSet bits = new BitSet(reader.maxDoc());
   TermDocs td = reader.termDocs(new Term("field", "text");
   while (td.next())
   {
       bits.set(td.doc());
   }
   return bits;

and HitCollector.collect(), as suggested in Javadocs

    final BitSet bits = new BitSet(indexReader.maxDoc());
    searcher.search(query, new HitCollector() {
        public void collect(int doc, float score) {
          bits.set(doc);
        }
      });

SOLR seems to use DocSetHitCollector in places which allows the DocSet interface 
to be used rather then plain old BitSet which allows small sets to be optimised, 
but does anyone know the performance implications of using HitCollector, if 
score is not required, against using Filter and then generating a DocSet?

Antony






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance between Filter and HitCollector?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 15, 2007, at 12:27 AM, Antony Bowesman wrote:
> Thanks for the detailed reponse Hoss.  That's the sort of in depth  
> golden nugget I'd like to see in a copy of LIA 2 when it becomes  
> available...

NOTED!   :)

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance between Filter and HitCollector?

Posted by Antony Bowesman <ad...@teamware.com>.

Thanks for the detailed reponse Hoss.  That's the sort of in depth golden nugget 
I'd like to see in a copy of LIA 2 when it becomes available...

I've wanted to use Filter to cache certain of my Term Queries, as it looked 
faster for straight Term Query searches, but Solr's DocSet interface abstraction 
is more useful.  HashDocSet will probably satisfy 90% of my cache.

Index DBs will typically be in the 1-3 million  documents range, but for mail 
which is spread over 1-6K user, so caching lots of BitSets for that number of 
users in not practical!

I ended up creating a DocSetFilter and creating DocSets (a la Solr) from BitSet 
which is then cached.  I then convert it back during Filter.bits().  Not the 
best solution, but the typical hit size is small, so the iteration is fast.

Thanks eks dev for the info about Lucene-584 - that looks like an interesting 
set of patches.

Antony

Chris Hostetter wrote:
> it's kind of an Apples/Oranges comparison .. in the examples you gave
> below, one is executing an arbitrary query (which oculd be anything) the
> other is doing a simple TermEnumeration.
> 
> Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
> faster becuase it does't have to compute any Scores ... generally speaking
> a a Filter will alwyas be a little faster then a functionally equivilent
> Query for the purposes of building up a simple BitSet of matching
> documents because teh Query involves the score calcuations ... but the
> Query is generally more usable.
> 
> The Query can also be more efficient in other ways, because the
> HitCollector doesn't *have* to build a BitSet, it can deal with the
> results in whatever way it wants (where as a Filter allways generates a
> BitSet).
> 
> Solr goes the HitCollector route for a few reasons:
>   1) allows us to use hte DocSet abstraction which allows other
>      performance benefits over straight BitSets
>   2) allows us to have simpler code that builds DocSets and DocLists
>      (DocLists know about scores, sorting, and pagination) in a single
>      pass when scores or sorting are requested.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance between Filter and HitCollector?

Posted by Chris Hostetter <ho...@fucit.org>.

it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
     performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
     (DocLists know about scores, sorting, and pagination) in a single
     pass when scores or sorting are requested.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org