You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Pieter Berkel <pi...@gmail.com> on 2007/11/12 00:59:56 UTC

Faceting over limited result set

I'm trying to obtain faceting information based on the first 'x' (lets say
100-500) results matching a given (dismax) query.  The actual documents
matching the query are not important in this case, so intuitively the
simplest approach I can think of would be to limit the result set to 'x'
documents.

Unfortunately I can't find any easy way to limit the number of documents
matched (and returned in the set).  It might be possible to achieve the
desired result by using a function query + filter query, however that seems
a but hack-ish and hopefully I've missed something basic that leads to a
simpler solution.

Apologies if this has already been discussed / solved before.

Thanks,
Piete

Re: Faceting over limited result set

Posted by Chris Hostetter <ho...@fucit.org>.

: It's not really a performance-related issue, the primary goal is to use the
: facet information to determine the most relevant product category related to
: the particular search being performed.

ah ... ok, i understand now.  the order does matter, you want the "top N" 
documents sorted by some criteria (either score, or maybe popularity i 
would imagine) and then you want to pick the categories based on that.

i had to build this for CNET back before solr went open source, but yes - 
i did it using a custom subclass of dismax similar to what i 
discribed before.

one thing to watch out for is that you probably want to use a consistent 
sort independent of the user's sort -- if the user re-sorts by price it 
can be disconcerting for them if that changes the navigation links.


-Hoss

Re: Faceting over limited result set

Posted by Pieter Berkel <pi...@gmail.com>.

On 13/11/2007, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> can you elaborate on your use case ... the only time i've ever seen people
> ask about something like this it was because true facet counts were too
> expensive to compute, so they were doing "sampling" of the first N
> results.
>
> In Solr, Sampling like this would likely be just as expensive as getting
> the full count.

It's not really a performance-related issue, the primary goal is to use the
facet information to determine the most relevant product category related to
the particular search being performed.

Generally the facets returned by simple, generic queries are fine for this
purpose (e.g. a search for "nokia" will correctly return "Mobile / Cell
Phone" as the most frequent facet), however facet data for more specific
searches are not as clear-cut (e.g. "samsung tv" where TVs will appear at
the top of the search results, but will also match other "samsung' products
like mobile phones and mp3 players - obviously I could tweak 'mm' parameter
to fix this particular case, but it wouldn't really solve my problem).

The theory is that facet information generated from the first 'x' (lets say
100) matches to a query (ordered by score / relevance) will be more accurate
(for the above purpose) than facets obtained over the entire result set.  So
ideally, it would be useful to be able to contstrain the size of the DocSet
somehow (as you mention below).

matching occurs in increasing order of docid, so even if there was as hook
> to say "stop matching after N docs" those N wouldn't be a good
> representative sample, they would be biased towards "older" documents
> (based on when they were indexed, not on any particular date field)
>
> if what you are interested in is stats on the first N docs according to a
> specific sort (score or otherwise) then you could write a custom request
> handler that executed a search with a limit of N, got the DocList,
> iterated over it to build a DocSet, and then used that DocSet to do
> faceting ... but that would probably take even longer then just using the
> full DocSet matching the entire query.

I was hoping to avoid having to write a custom request handler but your
suggestion above sounds like it would do the trick.  I'm also debating
whether to extract my own facet info from a result set on the client side,
but this would be even slower.

Thanks for your suggestions so far,
Piete

Re: Faceting over limited result set

Posted by Mike Klaas <mi...@gmail.com>.

On 13-Nov-07, at 4:44 PM, Pieter Berkel wrote:

> On Nov 14, 2007 6:44 AM, Mike Klaas <mi...@gmail.com> wrote:
>
> Thanks Mike, that looks like a good place to start.  While I really
> can't think of any practical use for limiting the size of DocSet other
> than simple faceting, the new search component architecture make it a
> little more difficult to confine any implementation to only the facet
> component (unless there is an efficient way to obtain a subset of a
> DocSet, which there doesn't seem to be).

DocSets (so far) are unordered so I don't see how that would work.

> I'm also aware of the query
> caching issues arising from SolrIndexSearcher however if N is
> sufficiently low this (hopefully) shouldn't be too much of a problem.
>
> I can't find either the SearcherUtils class nor any reference to a
> getDocSetFromDocList() method in svn trunk, is this deprecated or
> custom-build code?

Custom.  It is a handful of lines that just passes the docs from a  
DocIterator to DocSetHitCollector.

-Mike

Re: Faceting over limited result set

Posted by Pieter Berkel <pi...@gmail.com>.

On Nov 14, 2007 6:44 AM, Mike Klaas <mi...@gmail.com> wrote:
>
> An implementation might look like:
>
>          DocList superlist;
>          int facetDocLimit = params.getInt(DMP.FACET_DOCLIMIT, -1);
>          if(facetDocLimit > 0 && facetDocLimit != req.getLimit()) {
>            superlist = s.getDocList(query, restrictions,
>                                     SolrPluginUtils.getSort(req),
>                                     req.getStart(), facetDocLimit,
>                                     flags);
>            results.docSet = SearcherUtils.getDocSetFromDocList
> (superlist, s);
>            results.docList = superlist.subset(0, req.getLimit());
>          } else {
>
> Where getDocSetFromDocList() uses DocSetHitCollector to build a DocSet.
>
> To answer the performance question: There is a gain to be had when
> doing lots of faceting on huge indices, if N is low (say, 500-1000).
> One problem with the implementation above is that it stymies the
> query caching in SolrIndexSearcher (since the generated DocList is >
> the cache upper bound).
>
> -Mike

Thanks Mike, that looks like a good place to start.  While I really
can't think of any practical use for limiting the size of DocSet other
than simple faceting, the new search component architecture make it a
little more difficult to confine any implementation to only the facet
component (unless there is an efficient way to obtain a subset of a
DocSet, which there doesn't seem to be).  I'm also aware of the query
caching issues arising from SolrIndexSearcher however if N is
sufficiently low this (hopefully) shouldn't be too much of a problem.

I can't find either the SearcherUtils class nor any reference to a
getDocSetFromDocList() method in svn trunk, is this deprecated or
custom-build code?

-Piete

Re: Faceting over limited result set

Posted by Mike Klaas <mi...@gmail.com>.

On 12-Nov-07, at 8:03 AM, Chris Hostetter wrote:

>
> if what you are interested in is stats on the first N docs  
> according to a
> specific sort (score or otherwise) then you could write a custom  
> request
> handler that executed a search with a limit of N, got the DocList,
> iterated over it to build a DocSet, and then used that DocSet to do
> faceting ... but that would probably take even longer then just  
> using the
> full DocSet matching the entire query.

An implementation might look like:

         DocList superlist;
         int facetDocLimit = params.getInt(DMP.FACET_DOCLIMIT, -1);
         if(facetDocLimit > 0 && facetDocLimit != req.getLimit()) {
           superlist = s.getDocList(query, restrictions,
                                    SolrPluginUtils.getSort(req),
                                    req.getStart(), facetDocLimit,
                                    flags);
           results.docSet = SearcherUtils.getDocSetFromDocList 
(superlist, s);
           results.docList = superlist.subset(0, req.getLimit());
         } else {

Where getDocSetFromDocList() uses DocSetHitCollector to build a DocSet.

To answer the performance question: There is a gain to be had when  
doing lots of faceting on huge indices, if N is low (say, 500-1000).   
One problem with the implementation above is that it stymies the  
query caching in SolrIndexSearcher (since the generated DocList is >  
the cache upper bound).

-Mike

Re: Faceting over limited result set

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm trying to obtain faceting information based on the first 'x' (lets say
: 100-500) results matching a given (dismax) query.  The actual documents
: matching the query are not important in this case, so intuitively the

can you elaborate on your use case ... the only time i've ever seen people 
ask about something like this it was because true facet counts were too 
expensive to compute, so they were doing "sampling" of the first N 
results.

In Solr, Sampling like this would likely be just as expensive as getting 
the full count.

: Unfortunately I can't find any easy way to limit the number of documents
: matched (and returned in the set).  It might be possible to achieve the

matching occurs in increasing order of docid, so even if there was as hook 
to say "stop matching after N docs" those N wouldn't be a good 
representative sample, they would be biased towards "older" documents 
(based on when they were indexed, not on any particular date field)

if what you are interested in is stats on the first N docs according to a 
specific sort (score or otherwise) then you could write a custom request 
handler that executed a search with a limit of N, got the DocList, 
iterated over it to build a DocSet, and then used that DocSet to do 
faceting ... but that would probably take even longer then just using the 
full DocSet matching the entire query.

but again: what is your use case?  the underlying question really baffles 
me.


-Hoss