You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "J.J. Larrea" <jj...@panix.com> on 2006/12/06 20:52:48 UTC

Multi-Valued Faceting

Many thanks to Hoss, Yonik, et al. for their excellent efforts to 
bring faceted browsing to the masses!  In most cases it works great!

But for some fields I have a need which is unfortunately not filled 
by the current faceting code.  These fields (Author Name, for 
example) have too many discrete Term values to be handled by the 
cached-filter mechanism for facet counting, but require counting 
multiple Terms per document and so are not handled by the alternate 
facet mechanism based on FieldCache.  So I think I need to dive into 
the SOLR code, but I first wanted to check to make sure nobody else 
is working on something like this, and secondly to get feedback on 
the best implementation approach.

(As background to those not familiar with the internals, for a 
single-valued non-tokenized non-Boolean field the current 
SimpleFacets implementation uses a FieldCache borrowed from the 
Lucene sorting mechanism to quickly retrieve the indexed Term value 
for each Document; otherwise a series of queries cached as bitmaps 
are done for each possible Term value, ANDed with a hitlist bitmap, 
and the resulting bits counted)

My thought was that the simplest approach would be to subclass 
FieldCacheImpl to introduce a getMultiStringIndex method derived from 
getStringIndex, defining  and returning a MultiStringIndex class 
which stores order as int[][] rather than int[]; a variant of 
SimpleFacets.getFieldCacheCounts would simply need an inner loop to 
tally each of the Document's Term indexes for that field.

With multi-valuedness no longer being a useful criterion for 
automatically choosing between the filter-based and modified 
FieldCache-based mechanisms, there then would need to be an alternate 
criterion, either implicit or explicit. Does anyone have any ideas 
how best to do that?  For example, is there a way to quickly 
determine the number of distinct Term values for a field without 
enumerating to the end, so the ratio of Terms to Documents can be 
used?

An entirely alternate approach (briefly suggested in a comment in 
SimpleFacets) for fields indexed with term vectors would be to simply 
call getTermFreqVector, for each hit and store (term text, tally) in 
a HashTable, or (term text, index) in a HT which could be cached with 
tallies generated per-query in an array as they are now, in the 
latter case building a field-cache dynamically based on actual query 
results.  Does anyone have any insight on how efficient that may or 
may not be?

And if I have gotten something dreadfully wrong in my understanding 
of current implementation or proposed enhancement, I would appreciate 
getting straightened out.

Thanks,
J.J. Larrea

Re: Multi-Valued Faceting

Posted by Chris Hostetter <ho...@fucit.org>.
: > which stores order as int[][] rather than int[]; a variant of
: > SimpleFacets.getFieldCacheCounts would simply need an inner loop to
: > tally each of the Document's Term indexes for that field.
:
: I think something like that is the right approach, the only problem
: being the size in memory this would take up.  It may need some clever
: encoding to keep it reasonable.

yeah ... at some point you may wnat to consider the possibility
of "sampling" the values for the first N documents in your DocList and
then getting the counts for those values ... it's not a perfect solution,
but it should work efficiently no matter how many docs you have, or how
many unique field values there are in your index -- i don't know of any
other approach that can function as well for any data set.

: > how best to do that?  For example, is there a way to quickly
: > determine the number of distinct Term values for a field without
: > enumerating to the end, so the ratio of Terms to Documents can be
: > used?
:
: I'd suggest a Solr fieldInfo cache that stored info about a field:
: a) the number of unique terms in the field
: b) (optionally) a sorted list by docfreq of the top terms in the field

yeah ... this is what i'd orriginally invinisioned when i first wrote the
TermEnum based code in SimpleFacets before Yonik pointed out how usefull
the FieldCache could be ... keeping a list like yonik described in (b)
where the size of the list is sufficiently bigger them the typical "limit"
you put on your facet fields should provide a lot of wins -- if the way i
picture it in my head works out in reality, sizing that list with your
<HashDocSet maxSize="X"/> in mind might help you ensure that even if you
do wind up iterating over the full TermEnum, those terms all result in
HashDocSets which will be relatively small, so your filterCache can be
big.

computing the (a) value yonik mentioned is trivial to do while building up
(b) and (a) can be used to determine wether or not you really want to try
and build the MultiFieldCache or just walk the TermEnums.



-Hoss


Re: Multi-Valued Faceting

Posted by Yonik Seeley <yo...@apache.org>.
On 12/6/06, J.J. Larrea <jj...@panix.com> wrote:
> My thought was that the simplest approach would be to subclass
> FieldCacheImpl to introduce a getMultiStringIndex method derived from
> getStringIndex, defining  and returning a MultiStringIndex class
> which stores order as int[][] rather than int[]; a variant of
> SimpleFacets.getFieldCacheCounts would simply need an inner loop to
> tally each of the Document's Term indexes for that field.

I think something like that is the right approach, the only problem
being the size in memory this would take up.  It may need some clever
encoding to keep it reasonable.

> With multi-valuedness no longer being a useful criterion for
> automatically choosing between the filter-based and modified
> FieldCache-based mechanisms, there then would need to be an alternate
> criterion, either implicit or explicit. Does anyone have any ideas
> how best to do that?  For example, is there a way to quickly
> determine the number of distinct Term values for a field without
> enumerating to the end, so the ratio of Terms to Documents can be
> used?

I'd suggest a Solr fieldInfo cache that stored info about a field:
a) the number of unique terms in the field
b) (optionally) a sorted list by docfreq of the top terms in the field

> An entirely alternate approach (briefly suggested in a comment in
> SimpleFacets) for fields indexed with term vectors would be to simply
> call getTermFreqVector, for each hit and store (term text, tally) in
> a HashTable, or (term text, index) in a HT which could be cached with
> tallies generated per-query in an array as they are now, in the
> latter case building a field-cache dynamically based on actual query
> results.  Does anyone have any insight on how efficient that may or
> may not be?

For queries that don't have many hits, termvectors would be fine.
I don't think they would perform well with a lot of hits though.
There could be a different type of faceting that just uses the top "n"
results though.

> And if I have gotten something dreadfully wrong in my understanding
> of current implementation or proposed enhancement, I would appreciate
> getting straightened out.

Sounds like you have a pretty good handle on it!

-Yonik