You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Raffaella Ventaglio <r....@gmail.com> on 2009/02/07 19:57:19 UTC

Faceted search with OpenBitSet/SortedVIntList

Hi,

I am trying to implement a kind of faceted search using Lucene 2.4.0.

I have a list of configuration rules that tell me how to generate this
facets and the corresponding queries (that can range from simple term
queries to complex boolean queries).

When my application starts, it creates the whole set of facets objects and
initializes them.
For each facet:
- I create the query according to the configured rule;
- I ask the reader for the bitset corresponding to that query and I store it
in the Facet object;
- I get the cardinality of the bitset and I save it in the Facet object as
its "initial count".

When the user does a search I have to update the "counts" associated to each
Facet:
- I get the bitset corresponding to the "query + filter" generated by the
user search;
- I get the cardinality of the ("search bitset" AND "facet bitset") and I
save it as the updated count.


In my first solution, I used only "OpenBitSetDISI" objects, both for Facet
bitset and for search bitset.
So I could use "intersectionCount" method to get updated counts after user
search.

This works very well and it is very fast, but when the number of documents
in the index and the number of facets grows it is too memory consuming.


So I tried a different solution: when I create facet bitsets I use the same
rule applied in ChainedFilter/BooleanFilter to decide if I have to store an
OpenBitSet or a SortedVIntList.
When I have to calculate updated counts:
- if the facet has an OpenBitSet, I use the "intersectionCount" method
directly;
- if the facet has a SortedVIntList, I first create a new OpenBitSetDISI
using the SortedVIntList.iterator and then I use the "intersectionCount"
method.

In this way, I use a smaller amount of memory at initialization time, but
for each user search I create a large number of objects (that I suddenly
throw away) and this affects application performance because it wastes a lot
of time doing GC.

So my question is: is there a better way to accomplish this task?

I think, it would be fine if I could calculate "intersectionCount" directly
on SortedVIntList objects, but I have not found nothing like that in Lucene
2.4 JavaDoc.
Am I missing something?


As a reference, now my index contains more than 500.000 documents and I have
to create/manage up to 50.000 facets.
Using "second solution", at initialization time my facets structure requires
more or less 120MB (and this is good enough), while updating counts it uses
even 2GB of memory (and this is very bad).

Thanks in advance,
Raf

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Sameer Maggon <ma...@gmail.com>.
Did you look at Solr? It provides faceted search out of the box and is  
built on top of Lucene.

Sameer.

On Feb 7, 2009, at 10:57 AM, Raffaella Ventaglio  
<r....@gmail.com> wrote:

> Hi,
>
> I am trying to implement a kind of faceted search using Lucene 2.4.0.
>
> I have a list of configuration rules that tell me how to generate this
> facets and the corresponding queries (that can range from simple term
> queries to complex boolean queries).
>
> When my application starts, it creates the whole set of facets  
> objects and
> initializes them.
> For each facet:
> - I create the query according to the configured rule;
> - I ask the reader for the bitset corresponding to that query and I  
> store it
> in the Facet object;
> - I get the cardinality of the bitset and I save it in the Facet  
> object as
> its "initial count".
>
> When the user does a search I have to update the "counts" associated  
> to each
> Facet:
> - I get the bitset corresponding to the "query + filter" generated  
> by the
> user search;
> - I get the cardinality of the ("search bitset" AND "facet bitset")  
> and I
> save it as the updated count.
>
>
> In my first solution, I used only "OpenBitSetDISI" objects, both for  
> Facet
> bitset and for search bitset.
> So I could use "intersectionCount" method to get updated counts  
> after user
> search.
>
> This works very well and it is very fast, but when the number of  
> documents
> in the index and the number of facets grows it is too memory  
> consuming.
>
>
> So I tried a different solution: when I create facet bitsets I use  
> the same
> rule applied in ChainedFilter/BooleanFilter to decide if I have to  
> store an
> OpenBitSet or a SortedVIntList.
> When I have to calculate updated counts:
> - if the facet has an OpenBitSet, I use the "intersectionCount" method
> directly;
> - if the facet has a SortedVIntList, I first create a new  
> OpenBitSetDISI
> using the SortedVIntList.iterator and then I use the  
> "intersectionCount"
> method.
>
> In this way, I use a smaller amount of memory at initialization  
> time, but
> for each user search I create a large number of objects (that I  
> suddenly
> throw away) and this affects application performance because it  
> wastes a lot
> of time doing GC.
>
> So my question is: is there a better way to accomplish this task?
>
> I think, it would be fine if I could calculate "intersectionCount"  
> directly
> on SortedVIntList objects, but I have not found nothing like that in  
> Lucene
> 2.4 JavaDoc.
> Am I missing something?
>
>
> As a reference, now my index contains more than 500.000 documents  
> and I have
> to create/manage up to 50.000 facets.
> Using "second solution", at initialization time my facets structure  
> requires
> more or less 120MB (and this is good enough), while updating counts  
> it uses
> even 2GB of memory (and this is very bad).
>
> Thanks in advance,
> Raf

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 8, 2009, at 3:32 AM, Raffaella Ventaglio wrote:

> Hi Chris,
>
> The "SortedVIntList" approach is similar to field cache. It's better  
> to use
>> the fieldcache for the facet search, which is the "normal" approach  
>> and
>> used
>> in tools like Solr, DBSight, Bobo Browse Engine, etc.
>
>
> Thanks for your answer, I did not know about FieldCache.
> However, I think I cannot use it to solve my problem because, as I  
> said in
> my previous post, a lot of my "facets" are not related to a "value"  
> on a
> single field, but can be configured by the user by writing a complex  
> boolean
> query.

> And this is also the reason why I think I cannot use Solr to  
> implement this kind of faceted search.

Solr also supports facet queries... such that a count of matching  
documents within a constrained subset is returned for each facet.query  
provided.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Raffaella Ventaglio <r....@gmail.com>.
Hi Chris,

The "SortedVIntList" approach is similar to field cache. It's better to use
> the fieldcache for the facet search, which is the "normal" approach and
> used
> in tools like Solr, DBSight, Bobo Browse Engine, etc.


Thanks for your answer, I did not know about FieldCache.
However, I think I cannot use it to solve my problem because, as I said in
my previous post, a lot of my "facets" are not related to a "value" on a
single field, but can be configured by the user by writing a complex boolean
query.

And this is also the reason why I think I cannot use Solr to implement this
kind of faceted search.



> To avoid creating a lot of objects and quickly throwing them away, you can
> adjust Eden memory size, or you can create a bunch of objects and try to
> re-use them.
>

Our Eden memory size is already very big, but it is not sufficient and, in
any case, this solution would not be very scalable.
I was also thinking about creating a "pool" of OpenBitSet to reuse, but
before to implement this I thought to look if there were already a better
solution I was not aware of.

Thanks,
Raf

>
>

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Chris Lu <ch...@gmail.com>.
The first approach is rather limiting when facets number grows.

The "SortedVIntList" approach is similar to field cache. It's better to use
the fieldcache for the facet search, which is the "normal" approach and used
in tools like Solr, DBSight, Bobo Browse Engine, etc.

To avoid creating a lot of objects and quickly throwing them away, you can
adjust Eden memory size, or you can create a bunch of objects and try to
re-use them.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Sat, Feb 7, 2009 at 10:57 AM, Raffaella Ventaglio
<r....@gmail.com>wrote:

> Hi,
>
> I am trying to implement a kind of faceted search using Lucene 2.4.0.
>
> I have a list of configuration rules that tell me how to generate this
> facets and the corresponding queries (that can range from simple term
> queries to complex boolean queries).
>
> When my application starts, it creates the whole set of facets objects and
> initializes them.
> For each facet:
> - I create the query according to the configured rule;
> - I ask the reader for the bitset corresponding to that query and I store
> it
> in the Facet object;
> - I get the cardinality of the bitset and I save it in the Facet object as
> its "initial count".
>
> When the user does a search I have to update the "counts" associated to
> each
> Facet:
> - I get the bitset corresponding to the "query + filter" generated by the
> user search;
> - I get the cardinality of the ("search bitset" AND "facet bitset") and I
> save it as the updated count.
>
>
> In my first solution, I used only "OpenBitSetDISI" objects, both for Facet
> bitset and for search bitset.
> So I could use "intersectionCount" method to get updated counts after user
> search.
>
> This works very well and it is very fast, but when the number of documents
> in the index and the number of facets grows it is too memory consuming.
>
>
> So I tried a different solution: when I create facet bitsets I use the same
> rule applied in ChainedFilter/BooleanFilter to decide if I have to store an
> OpenBitSet or a SortedVIntList.
> When I have to calculate updated counts:
> - if the facet has an OpenBitSet, I use the "intersectionCount" method
> directly;
> - if the facet has a SortedVIntList, I first create a new OpenBitSetDISI
> using the SortedVIntList.iterator and then I use the "intersectionCount"
> method.
>
> In this way, I use a smaller amount of memory at initialization time, but
> for each user search I create a large number of objects (that I suddenly
> throw away) and this affects application performance because it wastes a
> lot
> of time doing GC.
>
> So my question is: is there a better way to accomplish this task?
>
> I think, it would be fine if I could calculate "intersectionCount" directly
> on SortedVIntList objects, but I have not found nothing like that in Lucene
> 2.4 JavaDoc.
> Am I missing something?
>
>
> As a reference, now my index contains more than 500.000 documents and I
> have
> to create/manage up to 50.000 facets.
> Using "second solution", at initialization time my facets structure
> requires
> more or less 120MB (and this is good enough), while updating counts it uses
> even 2GB of memory (and this is very bad).
>
> Thanks in advance,
> Raf
>

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by John Wang <jo...@gmail.com>.
Hi Paul:

     Took a little time to find these classes, they are post-2.4 additions I
think.

     Our representation:

1) We do not build on top of the FieldCacheImpl class for a few reasons: It
is keyed off of the IndexReader, and application cannot clear the cache, and
we assume indexes can update often, using FieldCacheImpl is not ideal. Also,
to support faceted search, we need to store the value array, and the only
thing available in FieldCacheImp is StringIndex. While this is nice and all,
if data is of primitive types, e.g. int, short etc. this is wasteful.
Another reason, though minor, is that we discovered with high doc count
indexes, the order array, keyed from 0 to maxdoc, for each field we want to
support, can cause problems with VM gc. We instead use a segmented index
array representation. Lastly, our data structure is loaded at index load
time, not lazily like FieldCacheImpl.

2) So our forward index representation is an order array, a value array (of
primitive type if specified), and some helper arrays which I will describe
later in detail.

3) For range queries, the idea is very similar to FieldCacheRangeFilter,
except for our primitive value store, and make use of our helper array, e.g.
min/max docid, e.g. for each value, we store corresponding min/max values,
so we know to skip to starting doc and stop early at maxdoc.

4) The differences between FieldCacheTermFilter are similar. Except that we
cased out when termList contains 1 element. We felt that was neccessary due
to the number of times next and skip are called when recall is large, the
price to pay for bitset.get() is too high.

5) We also consider the cases when indexed values are tokenized, e.g. 1 doc
can have more than 1 value in a field. And that representation is a more
involved than can be explained in an email. Comparing to 3), we need to
consider the case of AND in addition to OR which can be assumed when only 1
value per doc is allowed.

6) We also support hierarchical data.

I just merged to our trunk, feel free to take a look.

-John

On Sun, Feb 8, 2009 at 2:40 AM, Paul Elschot <pa...@xs4all.nl> wrote:

> John,
>
> On Sunday 08 February 2009 00:35:10 John Wang wrote:
> > Our implementation of facet search can handle this. Using bitsets for
> > intersection is not scalable performance wise when index is large.
> >
> > We are using a compact forwarded index representation in memory for the
> > counting.
>
> Could you describe how this compact forwarded index works?
>
> > Similar to FieldCache idea but more compact.
>
> Does this also use FieldCacheRangeFilter and/or FieldCacheTermsFilter?
>
>
> Regards,
> Paul Elschot
>

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Paul Elschot <pa...@xs4all.nl>.
John,

On Sunday 08 February 2009 00:35:10 John Wang wrote:
> Our implementation of facet search can handle this. Using bitsets for
> intersection is not scalable performance wise when index is large.
> 
> We are using a compact forwarded index representation in memory for the
> counting.

Could you describe how this compact forwarded index works?

> Similar to FieldCache idea but more compact.

Does this also use FieldCacheRangeFilter and/or FieldCacheTermsFilter?


Regards,
Paul Elschot

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by John Wang <jo...@gmail.com>.
Our implementation of facet search can handle this. Using bitsets for
intersection is not scalable performance wise when index is large.

We are using a compact forwarded index representation in memory for the
counting. Similar to FieldCache idea but more compact.

Check it out at: http://sourceforge.net/projects/bobo-browse/
Use the branch: BR_DEV_1_5_0
It is rather stable now, we are releasing it soon.

We are using this on production for LinkedIn with live traffic and it is
working well.

Test it out to see if your performance requirement is met and let us know
how to improve.

Thanks

-John

On Sat, Feb 7, 2009 at 2:06 PM, Paul Elschot <pa...@xs4all.nl> wrote:

> On Saturday 07 February 2009 19:57:19 Raffaella Ventaglio wrote:
> > Hi,
> >
> > I am trying to implement a kind of faceted search using Lucene 2.4.0.
> >
> > I have a list of configuration rules that tell me how to generate this
> > facets and the corresponding queries (that can range from simple term
> > queries to complex boolean queries).
> >
> > When my application starts, it creates the whole set of facets objects
> and
> > initializes them.
> > For each facet:
> > - I create the query according to the configured rule;
> > - I ask the reader for the bitset corresponding to that query and I store
> it
> > in the Facet object;
> > - I get the cardinality of the bitset and I save it in the Facet object
> as
> > its "initial count".
> >
> > When the user does a search I have to update the "counts" associated to
> each
> > Facet:
> > - I get the bitset corresponding to the "query + filter" generated by the
> > user search;
> > - I get the cardinality of the ("search bitset" AND "facet bitset") and I
> > save it as the updated count.
> >
> >
> > In my first solution, I used only "OpenBitSetDISI" objects, both for
> Facet
> > bitset and for search bitset.
> > So I could use "intersectionCount" method to get updated counts after
> user
> > search.
> >
> > This works very well and it is very fast, but when the number of
> documents
> > in the index and the number of facets grows it is too memory consuming.
> >
> >
> > So I tried a different solution: when I create facet bitsets I use the
> same
> > rule applied in ChainedFilter/BooleanFilter to decide if I have to store
> an
> > OpenBitSet or a SortedVIntList.
> > When I have to calculate updated counts:
> > - if the facet has an OpenBitSet, I use the "intersectionCount" method
> > directly;
> > - if the facet has a SortedVIntList, I first create a new OpenBitSetDISI
> > using the SortedVIntList.iterator and then I use the "intersectionCount"
> > method.
> >
> > In this way, I use a smaller amount of memory at initialization time, but
> > for each user search I create a large number of objects (that I suddenly
> > throw away) and this affects application performance because it wastes a
> lot
> > of time doing GC.
> >
> > So my question is: is there a better way to accomplish this task?
> >
> > I think, it would be fine if I could calculate "intersectionCount"
> directly
> > on SortedVIntList objects, but I have not found nothing like that in
> Lucene
> > 2.4 JavaDoc.
> > Am I missing something?
>
> You are not missing anything.
>
> OpenBitSet has an optimized implementation for intersection count, and
> there is no counterpart of that in SortedVIntList because until now there
> has been no need for it.
>
> One way to implement that would be to use one of the boolean combination
> filters in contrib, BooleanFilter or ChainedFilter,  and simply count the
> the number of times next() returns true on the result.
>
> In case the performance of that is not good enough, another way would be
> to directly add an intersection count method to SortedVIntList.
> However, SortedVIntList does not allow for an efficient iterator
> implementation of skipTo(), and skipTo() is used intensively by
> intersections.
>
> > As a reference, now my index contains more than 500.000 documents and I
> have
> > to create/manage up to 50.000 facets.
> > Using "second solution", at initialization time my facets structure
> requires
> > more or less 120MB (and this is good enough), while updating counts it
> uses
> > even 2GB of memory (and this is very bad).
>
> 50.000 facets? Well, in case the performance of the last suggestion is
> not good enough, one could try and implement a better data structure
> than OpenBitSet and SortedVIntList to provide a DocIdSetIterator,
> preferably with a fast skipTo() and possibly with a fast intersection
> count.
> In that case, you may want to ask further on the java-dev list.
>
> Regards,
> Paul Elschot
>

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Paul Elschot <pa...@xs4all.nl>.
On Sunday 08 February 2009 09:53:00 Uwe Schindler wrote:
> I would do so, it's really simple, you can even do it in an anonymous inner
> class.

It is indeed simple, but it might also help to take a look at the source code
of the Lucene classes involved.

Regards,
Paul Elschot

> 
> -----
> UWE SCHINDLER
> Webserver/Middleware Development
> PANGAEA - Publishing Network for Geoscientific and Environmental Data
> MARUM - University of Bremen
> Room 2500, Leobener Str., D-28359 Bremen
> Tel.: +49 421 218 65595
> Fax:  +49 421 218 65505
> http://www.pangaea.de/
> E-mail: uschindler@pangaea.de
> 
> > -----Original Message-----
> > From: Raffaella Ventaglio [mailto:r.ventaglio@gmail.com]
> > Sent: Sunday, February 08, 2009 9:47 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Faceted search with OpenBitSet/SortedVIntList
> > 
> > Hi Paul,
> > 
> > One way to implement that would be to use one of the boolean combination
> > > filters in contrib, BooleanFilter or ChainedFilter,  and simply count
> > the
> > > the number of times next() returns true on the result.
> > 
> > 
> > I am sorry, but I cannot understand: how can I create a BooleanFilter or a
> > ChainedFilter starting from two SortedVIntList objects?
> > I have not found any filter that takes an existing "DocIdSet" in its
> > constructor...
> > 
> > However I have seen that Filter interface is very easy to implement.
> > Should I create a custom Filter that wraps my SortedVIntList and than use
> > these filters to create a BooleanFilter?
> > 
> > Thanks,
> > Raf
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

RE: Faceted search with OpenBitSet/SortedVIntList

Posted by Uwe Schindler <us...@pangaea.de>.
I would do so, it's really simple, you can even do it in an anonymous inner
class.

-----
UWE SCHINDLER
Webserver/Middleware Development
PANGAEA - Publishing Network for Geoscientific and Environmental Data
MARUM - University of Bremen
Room 2500, Leobener Str., D-28359 Bremen
Tel.: +49 421 218 65595
Fax:  +49 421 218 65505
http://www.pangaea.de/
E-mail: uschindler@pangaea.de

> -----Original Message-----
> From: Raffaella Ventaglio [mailto:r.ventaglio@gmail.com]
> Sent: Sunday, February 08, 2009 9:47 AM
> To: java-user@lucene.apache.org
> Subject: Re: Faceted search with OpenBitSet/SortedVIntList
> 
> Hi Paul,
> 
> One way to implement that would be to use one of the boolean combination
> > filters in contrib, BooleanFilter or ChainedFilter,  and simply count
> the
> > the number of times next() returns true on the result.
> 
> 
> I am sorry, but I cannot understand: how can I create a BooleanFilter or a
> ChainedFilter starting from two SortedVIntList objects?
> I have not found any filter that takes an existing "DocIdSet" in its
> constructor...
> 
> However I have seen that Filter interface is very easy to implement.
> Should I create a custom Filter that wraps my SortedVIntList and than use
> these filters to create a BooleanFilter?
> 
> Thanks,
> Raf


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Paul Elschot <pa...@xs4all.nl>.
On Tuesday 17 February 2009 10:12:12 Raffaella Ventaglio wrote:
> Thanks for sharing this info.
> In any case, this is not a problem for me since I have used only the "idea"
> to choose between OpenBitSet and SortedViIntList from contrib BooleanFilter,
> but I have then implemented it in my own facets manager structure, so I do
> not use the "removed" finalResult method.

It would be possible to build a similar choice criterion between
OpenBitSet and SortedVIntList into CachingWrapperFilter to chose
the data structure to be used for caching.

For example when using the same criterion as in the removed methods
there, your original problem might not have occurred at all.

In the CachingWrapperFilter in trunk the choice is left to an overridable method.

Regards,
Paul Elschot


> 
> Regards,
> Raf
> 
> On Sun, Feb 15, 2009 at 2:39 PM, Paul Elschot <pa...@xs4all.nl>wrote:
> 
> >
> > Meanwhile the choice between SortedVIntList and OpenBitSet
> > has been removed from the trunk (development version),
> > that now uses OpenBitSet only:
> > https://issues.apache.org/jira/browse/LUCENE-1296
> >
> > In case there is preference to have SortedVIntList used in the
> > next lucene version (i.e. in cases when it is smaller  than
> > OpenBitSet), please comment at LUCENE-1296.
> >
> > Regards,
> > Paul Elschot
> >
> >
> >
> 

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Raffaella Ventaglio <r....@gmail.com>.
Thanks for sharing this info.
In any case, this is not a problem for me since I have used only the "idea"
to choose between OpenBitSet and SortedViIntList from contrib BooleanFilter,
but I have then implemented it in my own facets manager structure, so I do
not use the "removed" finalResult method.

Regards,
Raf

On Sun, Feb 15, 2009 at 2:39 PM, Paul Elschot <pa...@xs4all.nl>wrote:

>
> Meanwhile the choice between SortedVIntList and OpenBitSet
> has been removed from the trunk (development version),
> that now uses OpenBitSet only:
> https://issues.apache.org/jira/browse/LUCENE-1296
>
> In case there is preference to have SortedVIntList used in the
> next lucene version (i.e. in cases when it is smaller  than
> OpenBitSet), please comment at LUCENE-1296.
>
> Regards,
> Paul Elschot
>
>
>

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Paul Elschot <pa...@xs4all.nl>.
Meanwhile the choice between SortedVIntList and OpenBitSet
has been removed from the trunk (development version),
that now uses OpenBitSet only:
https://issues.apache.org/jira/browse/LUCENE-1296

In case there is preference to have SortedVIntList used in the
next lucene version (i.e. in cases when it is smaller  than
OpenBitSet), please comment at LUCENE-1296.

Regards,
Paul Elschot



On Sunday 08 February 2009 09:47:24 Raffaella Ventaglio wrote:
> Hi Paul,
> 
> One way to implement that would be to use one of the boolean combination
> > filters in contrib, BooleanFilter or ChainedFilter,  and simply count the
> > the number of times next() returns true on the result.
> 
> 
> I am sorry, but I cannot understand: how can I create a BooleanFilter or a
> ChainedFilter starting from two SortedVIntList objects?
> I have not found any filter that takes an existing "DocIdSet" in its
> constructor...
> 
> However I have seen that Filter interface is very easy to implement.
> Should I create a custom Filter that wraps my SortedVIntList and than use
> these filters to create a BooleanFilter?
> 
> Thanks,
> Raf
> 

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Raffaella Ventaglio <r....@gmail.com>.
Hi Paul,

One way to implement that would be to use one of the boolean combination
> filters in contrib, BooleanFilter or ChainedFilter,  and simply count the
> the number of times next() returns true on the result.


I am sorry, but I cannot understand: how can I create a BooleanFilter or a
ChainedFilter starting from two SortedVIntList objects?
I have not found any filter that takes an existing "DocIdSet" in its
constructor...

However I have seen that Filter interface is very easy to implement.
Should I create a custom Filter that wraps my SortedVIntList and than use
these filters to create a BooleanFilter?

Thanks,
Raf

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Paul Elschot <pa...@xs4all.nl>.
On Saturday 07 February 2009 19:57:19 Raffaella Ventaglio wrote:
> Hi,
> 
> I am trying to implement a kind of faceted search using Lucene 2.4.0.
> 
> I have a list of configuration rules that tell me how to generate this
> facets and the corresponding queries (that can range from simple term
> queries to complex boolean queries).
> 
> When my application starts, it creates the whole set of facets objects and
> initializes them.
> For each facet:
> - I create the query according to the configured rule;
> - I ask the reader for the bitset corresponding to that query and I store it
> in the Facet object;
> - I get the cardinality of the bitset and I save it in the Facet object as
> its "initial count".
> 
> When the user does a search I have to update the "counts" associated to each
> Facet:
> - I get the bitset corresponding to the "query + filter" generated by the
> user search;
> - I get the cardinality of the ("search bitset" AND "facet bitset") and I
> save it as the updated count.
> 
> 
> In my first solution, I used only "OpenBitSetDISI" objects, both for Facet
> bitset and for search bitset.
> So I could use "intersectionCount" method to get updated counts after user
> search.
> 
> This works very well and it is very fast, but when the number of documents
> in the index and the number of facets grows it is too memory consuming.
> 
> 
> So I tried a different solution: when I create facet bitsets I use the same
> rule applied in ChainedFilter/BooleanFilter to decide if I have to store an
> OpenBitSet or a SortedVIntList.
> When I have to calculate updated counts:
> - if the facet has an OpenBitSet, I use the "intersectionCount" method
> directly;
> - if the facet has a SortedVIntList, I first create a new OpenBitSetDISI
> using the SortedVIntList.iterator and then I use the "intersectionCount"
> method.
> 
> In this way, I use a smaller amount of memory at initialization time, but
> for each user search I create a large number of objects (that I suddenly
> throw away) and this affects application performance because it wastes a lot
> of time doing GC.
> 
> So my question is: is there a better way to accomplish this task?
> 
> I think, it would be fine if I could calculate "intersectionCount" directly
> on SortedVIntList objects, but I have not found nothing like that in Lucene
> 2.4 JavaDoc.
> Am I missing something?

You are not missing anything.

OpenBitSet has an optimized implementation for intersection count, and
there is no counterpart of that in SortedVIntList because until now there
has been no need for it.

One way to implement that would be to use one of the boolean combination
filters in contrib, BooleanFilter or ChainedFilter,  and simply count the
the number of times next() returns true on the result.

In case the performance of that is not good enough, another way would be
to directly add an intersection count method to SortedVIntList.
However, SortedVIntList does not allow for an efficient iterator
implementation of skipTo(), and skipTo() is used intensively by intersections.
 
> As a reference, now my index contains more than 500.000 documents and I have
> to create/manage up to 50.000 facets.
> Using "second solution", at initialization time my facets structure requires
> more or less 120MB (and this is good enough), while updating counts it uses
> even 2GB of memory (and this is very bad).

50.000 facets? Well, in case the performance of the last suggestion is
not good enough, one could try and implement a better data structure
than OpenBitSet and SortedVIntList to provide a DocIdSetIterator,
preferably with a fast skipTo() and possibly with a fast intersection count.
In that case, you may want to ask further on the java-dev list.

Regards,
Paul Elschot