You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Vish D." <vi...@gmail.com> on 2006/06/23 15:33:21 UTC

Faceted Browsing questions

Hi all,

I am trying to figure out how I can have some type of faceted browsing
working. I am also in need of a way to get a list of unique field values
within a query's results set (for filtering, etc...). When I say trying, I
mean having it up and running without much coding, b/c of time reasons. I
would most definitely be involved in some customizing just because of the
nature of the data I am working with. I have searched through the mailing
list and seen some posts mentioning BitSets DocSets, etc.., but wasn't clear
on if those are already built into the solr's nightly builds (I don't see
any documentation either on the wiki, or online). Can some please steer me
towards the right direction to have the above up in the short time?

Thanks a lot!

Vish

Re: Faceted Browsing questions

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jun 29, 2006, at 12:30 AM, Vish D. wrote:
> Any update on your progress? Eager to get my hands on on your  
> latest code...
> :=)

It's all in our Subversion repository:

	<http://sourceforge.net/projects/patacriticism>

Sorry I didn't announce it, but we do have a patacriticism- 
development e-mail list that you can subscribe to for commit messages.

I've got a dual object-type facet cache going on now, where  
TermQuery's are cached for most facets making them quite lightweight  
and currently fitting all of our faceted fields nicely in RAM.   
However, I'm layering one level of "relationship" between different  
document _types_ in Solr that is a cross-reference of tags->objects  
and usernames->objects, where the objects are the basic document type  
(type:A - for "archive") in Solr, and type:C documents are the  
folksonomical glue between a user and an object, supporting tagging  
and annotations currently.  These relationship "facets" are currently  
DocSet caches, but they fit into the same cache so the front-end can  
constrain the search space by agent (aka author/artist/etc), genre,  
archive, year, users and tags as if they were all the same sort of  
thing.

We're currently having some sysadmin folks get us set up with a  
production environment to run this thing.  All the pieces are there  
in our repository to bootstrap the Collex system by launching two  
command-lines, one for RoR and one for Solr via Jetty.  And an Ant  
build file "index" target to index a directory full of (our custom  
flavor of) RDF into Solr.  No instructions are yet available on  
bootstrapping it all just yet.  But feel free to tinker if you like.  :)

	Erik

Re: Faceted Browsing questions

Posted by "Vish D." <vi...@gmail.com>.

Erik,

Any update on your progress? Eager to get my hands on on your latest code...
:=)

Thanks!


On 6/28/06, Chris Hostetter <ho...@fucit.org> wrote:
>
> : > well, the most obvious solution i can think of would be a patch adding
> an
> : > invert() method to DocSet, HashDocSet and BitDocSet.   :)
> :
> : Yes, we could add a flip(startIndex, endIndex) or flip(maxDoc)
>
> Yeah ... like i suggested before just make maxDoc an intrinsic property
> that DocSets know when they are created.
>
> Another issue though is one of performance ... inverting a HashDocSet with
> only a few docs should probably produce a BitDocSet -- ideally using the
> same configured maxSize and loadFactor information that the
> SolrIndexSearcher uses ... perhaps the method for inverting/flipping a
> DocSet should live in in SolrIndexSearcher?
>
>
> -Hoss
>
>

Re: Faceted Browsing questions

Posted by Chris Hostetter <ho...@fucit.org>.

: > well, the most obvious solution i can think of would be a patch adding an
: > invert() method to DocSet, HashDocSet and BitDocSet.   :)
:
: Yes, we could add a flip(startIndex, endIndex) or flip(maxDoc)

Yeah ... like i suggested before just make maxDoc an intrinsic property
that DocSets know when they are created.

Another issue though is one of performance ... inverting a HashDocSet with
only a few docs should probably produce a BitDocSet -- ideally using the
same configured maxSize and loadFactor information that the
SolrIndexSearcher uses ... perhaps the method for inverting/flipping a
DocSet should live in in SolrIndexSearcher?


-Hoss

Re: Faceted Browsing questions

Posted by Yonik Seeley <ys...@gmail.com>.

On 6/26/06, Chris Hostetter <ho...@fucit.org> wrote:
> : My next challenge is to re-implement the catch-all facets that I used
> : to do by unioning all documents in an (Open)BitSet and inverting it.
> : How can I invert a DocSet (I realize I gat get the bits and do it
> : that way, but is there a better way)?
>
> well, the most obvious solution i can think of would be a patch adding an
> invert() method to DocSet, HashDocSet and BitDocSet.   :)

Yes, we could add a flip(startIndex, endIndex) or flip(maxDoc)

If the inverted set is just for the purposes of getting an
intersection count with another set, you could also just do
mySet.andNotSize(theBigUnionSet)

-Yonik

Re: Faceted Browsing questions

Posted by Chris Hostetter <ho...@fucit.org>.

: It may not even be necessary to cache this type of lookup since it is
: simply a TermEnum through specific fields in the index.  Maybe simply
: doing the TermEnum in the request handler instead of iterating
: through a cache would be just as fast or faster.  Any thoughts on that?

While commuting I've been letting my brain bounce arround various ideas
for a completley generic totally reusable faceting request handler, and
I've been mulling over teh same question ... my current theory is that it
might make sense to cache a bounded Priority queue of the Terms for each
faceting field where the priority is determined by the docFreq, and the
size is configurable.  that way you can start with the values in the
queue and if/when you reach a point where the docFreq of the next item in
the queue is less then the lowest intersection count you've found so far,
and you already have as many items as you want to display, you don't have
to bother checking all of the other values (and you don't have to bother
with the TermEnum unless you completely exhaust the queue)

: My next challenge is to re-implement the catch-all facets that I used
: to do by unioning all documents in an (Open)BitSet and inverting it.
: How can I invert a DocSet (I realize I gat get the bits and do it
: that way, but is there a better way)?

well, the most obvious solution i can think of would be a patch adding an
invert() method to DocSet, HashDocSet and BitDocSet.   :)

there was some discussion about this on the list previously if i recall
correctly.


-Hoss

Re: Faceted Browsing questions

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jun 24, 2006, at 4:29 PM, Yonik Seeley wrote:
> On 6/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>> This weekend :)   I have imported more data than my hacked
>> implementation can handle without bumping up Jetty's JVM heap size,
>> so I'm now at the point where it is necessary for me to start using
>> the LRUCache.  Though I have already refactored to use OpenBitSet
>> instead of BitSet.
>
> You can also fit more in mem if you can use DocSet (HashDocSet) for
> smaller sets.  This will also speed up intersection counts.  This is
> done automatically when you get the DocSet from Solr, or if numDocs()
> is used.

Thanks for this advice, Yonik.   I've refactored (but not committed  
yet, for those that may be looking to see what I've done) the  
caching.  The cache (currently a single HashMap) is built keyed by  
field name, with nested HashMap's keyed by field value.  The inner  
map used to contain BitSets, then OpenBitSets, but now it contains  
only TermQuery's.  Now I simply use SolrIndexSearcher.getDocSet 
(query) and rely on the existing query caching.  The only thing my  
custom cache puts into RAM now is this HashMap of all faceted fields,  
values, and associated TermQuery's.  At some point that might even  
become an issue, but maybe not.

It may not even be necessary to cache this type of lookup since it is  
simply a TermEnum through specific fields in the index.  Maybe simply  
doing the TermEnum in the request handler instead of iterating  
through a cache would be just as fast or faster.  Any thoughts on that?

Either way, at the moment things are screaming fast and memory is  
pleasantly under control.

My next challenge is to re-implement the catch-all facets that I used  
to do by unioning all documents in an (Open)BitSet and inverting it.   
How can I invert a DocSet (I realize I gat get the bits and do it  
that way, but is there a better way)?

	Erik

Re: Faceted Browsing questions

Posted by Yonik Seeley <ys...@gmail.com>.

On 6/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> This weekend :)   I have imported more data than my hacked
> implementation can handle without bumping up Jetty's JVM heap size,
> so I'm now at the point where it is necessary for me to start using
> the LRUCache.  Though I have already refactored to use OpenBitSet
> instead of BitSet.

You can also fit more in mem if you can use DocSet (HashDocSet) for
smaller sets.  This will also speed up intersection counts.  This is
done automatically when you get the DocSet from Solr, or if numDocs()
is used.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

Re: Faceted Browsing questions

Posted by "Vish D." <vi...@gmail.com>.

Erik,

Oh good! Keep me (us) updated!!

As for committing some code into Solr, and the real world uses, I am sure we
can find some generic/abstract rules for faceted browsing -- simplest being,
a set of fields/categories defined in schema.xml, which could be used for an
optional extented query response, or a custom/new response by itself.

I am also sure that we have at least a couple other implementation of this
feature, which might bring in some good insights in "better" use of code. In
any case, I am eager to see this feature "ironed" out on the community
level.

Thanks!


On 6/24/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Jun 24, 2006, at 12:38 PM, Vish D. wrote:
> > Erik, when do you plan on having your implementation refactored
> > with "good"
> > use of code?
>
> This weekend :)   I have imported more data than my hacked
> implementation can handle without bumping up Jetty's JVM heap size,
> so I'm now at the point where it is necessary for me to start using
> the LRUCache.  Though I have already refactored to use OpenBitSet
> instead of BitSet.
>
> > Or, in general, when is Solr planning on having this feature
> > out (as I see it on the wiki for near term features)? It might be
> > better for
> > me to wait and see how the group decides to implement it, rather
> > than having
> > something done myself and have to drop it at the end. Plus, you guys
> > probably have the higher hand when it comes to knowing the details of
> > Solr/Lucene, and its re-useable features.
>
> The best way for Solr to get this functionality is for those that
> have implemented it in a custom fashion to get together and
> generalize it, so that we have a proven architecture that is
> configurable enough to handle real world situations.  My
> implementation is still being ironed out.  And it does rely on custom
> request handlers to utilize the facets and return back the counts per
> facet.
>
>         Erik
>
>
>

Re: Faceted Browsing questions

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jun 24, 2006, at 12:38 PM, Vish D. wrote:
> Erik, when do you plan on having your implementation refactored  
> with "good"
> use of code?

This weekend :)   I have imported more data than my hacked  
implementation can handle without bumping up Jetty's JVM heap size,  
so I'm now at the point where it is necessary for me to start using  
the LRUCache.  Though I have already refactored to use OpenBitSet  
instead of BitSet.

> Or, in general, when is Solr planning on having this feature
> out (as I see it on the wiki for near term features)? It might be  
> better for
> me to wait and see how the group decides to implement it, rather  
> than having
> something done myself and have to drop it at the end. Plus, you guys
> probably have the higher hand when it comes to knowing the details of
> Solr/Lucene, and its re-useable features.

The best way for Solr to get this functionality is for those that  
have implemented it in a custom fashion to get together and  
generalize it, so that we have a proven architecture that is  
configurable enough to handle real world situations.  My  
implementation is still being ironed out.  And it does rely on custom  
request handlers to utilize the facets and return back the counts per  
facet.

	Erik

Re: Faceted Browsing questions

Posted by "Vish D." <vi...@gmail.com>.

Thank you Chris and Erik. That makes it a bit clearer, but I might need to
sit down and look at the code (nines + DisMax...) a bit closer to see how it
all works in Solr.

Erik, when do you plan on having your implementation refactored with "good"
use of code? Or, in general, when is Solr planning on having this feature
out (as I see it on the wiki for near term features)? It might be better for
me to wait and see how the group decides to implement it, rather than having
something done myself and have to drop it at the end. Plus, you guys
probably have the higher hand when it comes to knowing the details of
Solr/Lucene, and its re-useable features.

Thanks all, and just wanted to say -- I am quite impressed by how Solr is
being taken on by the community. It's a solid search api, if it fits your
needs.

On 6/23/06, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : nature of the data I am working with. I have searched through the
> mailing
> : list and seen some posts mentioning BitSets DocSets, etc.., but wasn't
> clear
> : on if those are already built into the solr's nightly builds (I don't
> see
> : any documentation either on the wiki, or online). Can some please steer
> me
> : towards the right direction to have the above up in the short time?
>
> You'll want to start with the Solr javadocs, which are linked to from the
> left nav of every page on the Solr website ("Documentation > API Docs")...
>
>         http://incubator.apache.org/solr/docs/api/
>
> The DocSet classes are in fact a core part of Solr.
>
> There are some examples in email threads where Erik sent out some code
> demonstrating how he was doing faceting using BitSets, and I suggested
> ways he could do things using DocSets ... another good example you can
> look at is the code for the DisMaxRequestHandler.  It doesn't do faceting,
> but it does use DocSets when dealing with the "fq" (filter query) param.
>
> That should be a good place to start.
>
>
> -Hoss
>
>

Re: Faceted Browsing questions

Posted by Chris Hostetter <ho...@fucit.org>.

: nature of the data I am working with. I have searched through the mailing
: list and seen some posts mentioning BitSets DocSets, etc.., but wasn't clear
: on if those are already built into the solr's nightly builds (I don't see
: any documentation either on the wiki, or online). Can some please steer me
: towards the right direction to have the above up in the short time?

You'll want to start with the Solr javadocs, which are linked to from the
left nav of every page on the Solr website ("Documentation > API Docs")...

	http://incubator.apache.org/solr/docs/api/

The DocSet classes are in fact a core part of Solr.

There are some examples in email threads where Erik sent out some code
demonstrating how he was doing faceting using BitSets, and I suggested
ways he could do things using DocSets ... another good example you can
look at is the code for the DisMaxRequestHandler.  It doesn't do faceting,
but it does use DocSets when dealing with the "fq" (filter query) param.

That should be a good place to start.


-Hoss

Re: Faceted Browsing questions

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

I'm extremely time constrained at the moment, but I'll reply  
briefly.  Solr provides the ground work for making faceted features  
possible, but out of the box it does not provide it without coding a  
custom request handler and knowing a little about Lucene and Solr's  
APIs.  As you've seen, bits and pieces have been posted to the list.   
My project is open-source at the "patacriticism" project at  
SourceForge, under the "nines" folder in Subversion.  Feel free to  
have a peek there, but its certainly going to change dramatically  
soon to take better advantage of Solr's caching infrastructure - so  
take it as a (bad) example for now.

	Erik

On Jun 23, 2006, at 9:33 AM, Vish D. wrote:

> Hi all,
>
> I am trying to figure out how I can have some type of faceted browsing
> working. I am also in need of a way to get a list of unique field  
> values
> within a query's results set (for filtering, etc...). When I say  
> trying, I
> mean having it up and running without much coding, b/c of time  
> reasons. I
> would most definitely be involved in some customizing just because  
> of the
> nature of the data I am working with. I have searched through the  
> mailing
> list and seen some posts mentioning BitSets DocSets, etc.., but  
> wasn't clear
> on if those are already built into the solr's nightly builds (I  
> don't see
> any documentation either on the wiki, or online). Can some please  
> steer me
> towards the right direction to have the above up in the short time?
>
> Thanks a lot!
>
> Vish