You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Raffaella Ventaglio <r....@gmail.com> on 2009/02/07 19:57:19 UTC

Faceted search with OpenBitSet/SortedVIntList

Hi,

I am trying to implement a kind of faceted search using Lucene 2.4.0.

I have a list of configuration rules that tell me how to generate this
facets and the corresponding queries (that can range from simple term
queries to complex boolean queries).

When my application starts, it creates the whole set of facets objects and
initializes them.
For each facet:
- I create the query according to the configured rule;
- I ask the reader for the bitset corresponding to that query and I store it
in the Facet object;
- I get the cardinality of the bitset and I save it in the Facet object as
its "initial count".

When the user does a search I have to update the "counts" associated to each
Facet:
- I get the bitset corresponding to the "query + filter" generated by the
user search;
- I get the cardinality of the ("search bitset" AND "facet bitset") and I
save it as the updated count.


In my first solution, I used only "OpenBitSetDISI" objects, both for Facet
bitset and for search bitset.
So I could use "intersectionCount" method to get updated counts after user
search.

This works very well and it is very fast, but when the number of documents
in the index and the number of facets grows it is too memory consuming.


So I tried a different solution: when I create facet bitsets I use the same
rule applied in ChainedFilter/BooleanFilter to decide if I have to store an
OpenBitSet or a SortedVIntList.
When I have to calculate updated counts:
- if the facet has an OpenBitSet, I use the "intersectionCount" method
directly;
- if the facet has a SortedVIntList, I first create a new OpenBitSetDISI
using the SortedVIntList.iterator and then I use the "intersectionCount"
method.

In this way, I use a smaller amount of memory at initialization time, but
for each user search I create a large number of objects (that I suddenly
throw away) and this affects application performance because it wastes a lot
of time doing GC.

So my question is: is there a better way to accomplish this task?

I think, it would be fine if I could calculate "intersectionCount" directly
on SortedVIntList objects, but I have not found nothing like that in Lucene
2.4 JavaDoc.
Am I missing something?


As a reference, now my index contains more than 500.000 documents and I have
to create/manage up to 50.000 facets.
Using "second solution", at initialization time my facets structure requires
more or less 120MB (and this is good enough), while updating counts it uses
even 2GB of memory (and this is very bad).

Thanks in advance,
Raf

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Sameer Maggon <ma...@gmail.com>.

Did you look at Solr? It provides faceted search out of the box and is  
built on top of Lucene.

Sameer.

On Feb 7, 2009, at 10:57 AM, Raffaella Ventaglio  
<r....@gmail.com> wrote:

> Hi,
>
> I am trying to implement a kind of faceted search using Lucene 2.4.0.
>
> I have a list of configuration rules that tell me how to generate this
> facets and the corresponding queries (that can range from simple term
> queries to complex boolean queries).
>
> When my application starts, it creates the whole set of facets  
> objects and
> initializes them.
> For each facet:
> - I create the query according to the configured rule;
> - I ask the reader for the bitset corresponding to that query and I  
> store it
> in the Facet object;
> - I get the cardinality of the bitset and I save it in the Facet  
> object as
> its "initial count".
>
> When the user does a search I have to update the "counts" associated  
> to each
> Facet:
> - I get the bitset corresponding to the "query + filter" generated  
> by the
> user search;
> - I get the cardinality of the ("search bitset" AND "facet bitset")  
> and I
> save it as the updated count.
>
>
> In my first solution, I used only "OpenBitSetDISI" objects, both for  
> Facet
> bitset and for search bitset.
> So I could use "intersectionCount" method to get updated counts  
> after user
> search.
>
> This works very well and it is very fast, but when the number of  
> documents
> in the index and the number of facets grows it is too memory  
> consuming.
>
>
> So I tried a different solution: when I create facet bitsets I use  
> the same
> rule applied in ChainedFilter/BooleanFilter to decide if I have to  
> store an
> OpenBitSet or a SortedVIntList.
> When I have to calculate updated counts:
> - if the facet has an OpenBitSet, I use the "intersectionCount" method
> directly;
> - if the facet has a SortedVIntList, I first create a new  
> OpenBitSetDISI
> using the SortedVIntList.iterator and then I use the  
> "intersectionCount"
> method.
>
> In this way, I use a smaller amount of memory at initialization  
> time, but
> for each user search I create a large number of objects (that I  
> suddenly
> throw away) and this affects application performance because it  
> wastes a lot
> of time doing GC.
>
> So my question is: is there a better way to accomplish this task?
>
> I think, it would be fine if I could calculate "intersectionCount"  
> directly
> on SortedVIntList objects, but I have not found nothing like that in  
> Lucene
> 2.4 JavaDoc.
> Am I missing something?
>
>
> As a reference, now my index contains more than 500.000 documents  
> and I have
> to create/manage up to 50.000 facets.
> Using "second solution", at initialization time my facets structure  
> requires
> more or less 120MB (and this is good enough), while updating counts  
> it uses
> even 2GB of memory (and this is very bad).
>
> Thanks in advance,
> Raf

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 8, 2009, at 3:32 AM, Raffaella Ventaglio wrote:

> Hi Chris,
>
> The "SortedVIntList" approach is similar to field cache. It's better  
> to use
>> the fieldcache for the facet search, which is the "normal" approach  
>> and
>> used
>> in tools like Solr, DBSight, Bobo Browse Engine, etc.
>
>
> Thanks for your answer, I did not know about FieldCache.
> However, I think I cannot use it to solve my problem because, as I  
> said in
> my previous post, a lot of my "facets" are not related to a "value"  
> on a
> single field, but can be configured by the user by writing a complex  
> boolean
> query.

> And this is also the reason why I think I cannot use Solr to  
> implement this kind of faceted search.

Solr also supports facet queries... such that a count of matching  
documents within a constrained subset is returned for each facet.query  
provided.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Raffaella Ventaglio <r....@gmail.com>.

Hi Chris,

The "SortedVIntList" approach is similar to field cache. It's better to use
> the fieldcache for the facet search, which is the "normal" approach and
> used
> in tools like Solr, DBSight, Bobo Browse Engine, etc.


Thanks for your answer, I did not know about FieldCache.
However, I think I cannot use it to solve my problem because, as I said in
my previous post, a lot of my "facets" are not related to a "value" on a
single field, but can be configured by the user by writing a complex boolean
query.

And this is also the reason why I think I cannot use Solr to implement this
kind of faceted search.



> To avoid creating a lot of objects and quickly throwing them away, you can
> adjust Eden memory size, or you can create a bunch of objects and try to
> re-use them.
>

Our Eden memory size is already very big, but it is not sufficient and, in
any case, this solution would not be very scalable.
I was also thinking about creating a "pool" of OpenBitSet to reuse, but
before to implement this I thought to look if there were already a better
solution I was not aware of.

Thanks,
Raf

>
>

Re: Faceted search with OpenBitSet/SortedVIntList

Posted by Chris Lu <ch...@gmail.com>.

The first approach is rather limiting when facets number grows.

The "SortedVIntList" approach is similar to field cache. It's better to use
the fieldcache for the facet search, which is the "normal" approach and used
in tools like Solr, DBSight, Bobo Browse Engine, etc.

To avoid creating a lot of objects and quickly throwing them away, you can
adjust Eden memory size, or you can create a bunch of objects and try to
re-use them.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Sat, Feb 7, 2009 at 10:57 AM, Raffaella Ventaglio
<r....@gmail.com>wrote:

> Hi,
>
> I am trying to implement a kind of faceted search using Lucene 2.4.0.
>
> I have a list of configuration rules that tell me how to generate this
> facets and the corresponding queries (that can range from simple term
> queries to complex boolean queries).
>
> When my application starts, it creates the whole set of facets objects and
> initializes them.
> For each facet:
> - I create the query according to the configured rule;
> - I ask the reader for the bitset corresponding to that query and I store
> it
> in the Facet object;
> - I get the cardinality of the bitset and I save it in the Facet object as
> its "initial count".
>
> When the user does a search I have to update the "counts" associated to
> each
> Facet:
> - I get the bitset corresponding to the "query + filter" generated by the
> user search;
> - I get the cardinality of the ("search bitset" AND "facet bitset") and I
> save it as the updated count.
>
>
> In my first solution, I used only "OpenBitSetDISI" objects, both for Facet
> bitset and for search bitset.
> So I could use "intersectionCount" method to get updated counts after user
> search.
>
> This works very well and it is very fast, but when the number of documents
> in the index and the number of facets grows it is too memory consuming.
>
>
> So I tried a different solution: when I create facet bitsets I use the same
> rule applied in ChainedFilter/BooleanFilter to decide if I have to store an
> OpenBitSet or a SortedVIntList.
> When I have to calculate updated counts:
> - if the facet has an OpenBitSet, I use the "intersectionCount" method
> directly;
> - if the facet has a SortedVIntList, I first create a new OpenBitSetDISI
> using the SortedVIntList.iterator and then I use the "intersectionCount"
> method.
>
> In this way, I use a smaller amount of memory at initialization time, but
> for each user search I create a large number of objects (that I suddenly
> throw away) and this affects application performance because it wastes a
> lot
> of time doing GC.
>
> So my question is: is there a better way to accomplish this task?
>
> I think, it would be fine if I could calculate "intersectionCount" directly
> on SortedVIntList objects, but I have not found nothing like that in Lucene
> 2.4 JavaDoc.
> Am I missing something?
>
>
> As a reference, now my index contains more than 500.000 documents and I
> have
> to create/manage up to 50.000 facets.
> Using "second solution", at initialization time my facets structure
> requires
> more or less 120MB (and this is good enough), while updating counts it uses
> even 2GB of memory (and this is very bad).
>
> Thanks in advance,
> Raf
>