You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2011/09/24 02:59:08 UTC

Getting facet counts for 10,000 most relevant hits

If relevance ranking is working well, in theory it doesn't matter how many hits you get as long as the best results show up in the first page of results.  However, the default in choosing which facet values to show is to show the facets with the highest count in the entire result set.  Is there a way to issue some kind of a filter query or facet query that would show only the facet counts for the 10,000 most relevant search results?

As an example, if you search in our full-text collection for "jaguar" you get 170,000 hits.  If I am looking for the car rather than the OS or the animal, I might expect to be able to click on a facet and limit my results to the car.  However, facets containing the word car or automobile are not in the top 5 facets that we show.  If you click on "more"  you will see "automobile periodicals" but not the rest of the facets containing the word automobile .  This occurs because the facet counts are for all 170,000 hits.  The facet counts  for at least 160,000 irrelevant hits are included (assuming only the top 10,000 hits are relevant) .

What we would like to do is get the facet counts for the N most relevant documents and select the 5 or 30 facet values with the highest counts for those relevant documents.

Is this possible or would it require writing some lucene or Solr code?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

RE: Getting facet counts for 10,000 most relevant hits

Posted by Chris Hostetter <ho...@fucit.org>.
: I didn't realize how much more complicated this gets with distributed 
: search. Do you think it's worth opening a JIRA issue for this?

features are always worth opening jiras for if you have ideas related to 
those features to add as comments (or a patch)

by all means open a jira and put whatever relevant notes you think make 
sense (crib from my email as much as you want)

as i (think i) smentioned: the only feasible way i can think of to 
appraoch this type of problem in a generalized way at scale is to think 
about hte API as a "sampling" API, where instead of specying absolute (ie: 
give me the top 100 constraints from the top 10,000 matches) the API works 
in terms of "goals" (ie: suggest the top 100 constraints based on top 10% 
matches") and then solr has some wiggle room -- it can ask each shard for 
the 100*N constraints from the top (10*M)% matches, then weght all those 
constraints based on how many matches come from each shard to pick the 
final 100 constraints, then ask each shard for the final counts from those 
constraints (like it already does)

: Is there already some ongoing work on the faceting code that this might fit in with?

not that i know of.


-Hoss

RE: Getting facet counts for 10,000 most relevant hits

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Thanks so much for your reply Hoss,

I didn't realize how much more complicated this gets with distributed search. Do you think it's worth opening a JIRA issue for this?
Is there already some ongoing work on the faceting code that this might fit in with?

In the meantime, I think I'll go ahead and do some performance tests on my kludge.  That might work for us as an interim measure until I have time to dive into the Solr/Lucene distributed faceting code.

Tom

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Friday, September 30, 2011 9:20 PM
To: solr-user@lucene.apache.org
Subject: RE: Getting facet counts for 10,000 most relevant hits


: I figured out how to do this in a kludgey way on the client side but it 
: seems this could be implemented much more efficiently at the Solr/Lucene 
: level.  I described my kludge and posted a question about this to the 

It can, and I have -- but only for the case of a single node...

In general the faceting code in solr just needs a DocSet.  the default 
imple uses the DocSet computed as aside effect when executing the main 
search, but a custom SearchComponent could pick any DocSet it wants.

A few years back I wrote a custom faceting plugin that computed a "score" 
for each constraint based on:
 * Editorially assigned weights from a config file
 * the number of matching documents (ie: normal constraint count)
 * the number of matching documents from hte first N results

...where the last number was determined by internally executing the search 
with "rows" of N, to generate a DocList object, nad then converting that 
DocList into a DocSet, and using that as the input to SimpleFacetCounts.

Ignoring the "Editorial weights" part of the above, the logic for 
"scoring" constraints based on the other two factors is general enough 
thta it could be implemented in solr, we just need a way to configure "N" 
and what kind of function should be applied to the two counts.

	...But...

This approach really breaks down in a distributed model.  You can't do the 
same quick and easy DocList->DocSet transformation on each node, you have 
to do more complicated federating logic like the existing FacetComponent 
code does, and even there we don't have anything that would help with the 
"only the first N" type logic.  My best idea would be to do the same thing 
you describe in your "kludge" approach to solving this in the client...

: (http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html).  

...the coordinator would have to query all of the shards for their top N, 
and then tell each one exactly which of those docs to include in the 
"weighted facets constraints" count ... which would make for some relaly 
big requests if N is large.

the only sane way to do this type of thing efficiently in a distributed 
setup would probably be to treat the "top N" part of the goal as a 
"guideline" for a sampling problem, telling each shard to consider only 
*their* top N results when computing the top facets in shardReq #1, and 
then do the same "give me an exact count" type logic in shardReq #2 
thta we already do.  So the constraints picked may not acutally be 
the top constraints for the first N docs across the whole collection (just 
like right now they aren't garunteed to be the top constraints for all 
docs in the collection in a long tail situation), but they would 
representative of the "first-ish" docs across the whole collection.

-Hoss

RE: Getting facet counts for 10,000 most relevant hits

Posted by Chris Hostetter <ho...@fucit.org>.
: I figured out how to do this in a kludgey way on the client side but it 
: seems this could be implemented much more efficiently at the Solr/Lucene 
: level.  I described my kludge and posted a question about this to the 

It can, and I have -- but only for the case of a single node...

In general the faceting code in solr just needs a DocSet.  the default 
imple uses the DocSet computed as aside effect when executing the main 
search, but a custom SearchComponent could pick any DocSet it wants.

A few years back I wrote a custom faceting plugin that computed a "score" 
for each constraint based on:
 * Editorially assigned weights from a config file
 * the number of matching documents (ie: normal constraint count)
 * the number of matching documents from hte first N results

...where the last number was determined by internally executing the search 
with "rows" of N, to generate a DocList object, nad then converting that 
DocList into a DocSet, and using that as the input to SimpleFacetCounts.

Ignoring the "Editorial weights" part of the above, the logic for 
"scoring" constraints based on the other two factors is general enough 
thta it could be implemented in solr, we just need a way to configure "N" 
and what kind of function should be applied to the two counts.

	...But...

This approach really breaks down in a distributed model.  You can't do the 
same quick and easy DocList->DocSet transformation on each node, you have 
to do more complicated federating logic like the existing FacetComponent 
code does, and even there we don't have anything that would help with the 
"only the first N" type logic.  My best idea would be to do the same thing 
you describe in your "kludge" approach to solving this in the client...

: (http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html).  

...the coordinator would have to query all of the shards for their top N, 
and then tell each one exactly which of those docs to include in the 
"weighted facets constraints" count ... which would make for some relaly 
big requests if N is large.

the only sane way to do this type of thing efficiently in a distributed 
setup would probably be to treat the "top N" part of the goal as a 
"guideline" for a sampling problem, telling each shard to consider only 
*their* top N results when computing the top facets in shardReq #1, and 
then do the same "give me an exact count" type logic in shardReq #2 
thta we already do.  So the constraints picked may not acutally be 
the top constraints for the first N docs across the whole collection (just 
like right now they aren't garunteed to be the top constraints for all 
docs in the collection in a long tail situation), but they would 
representative of the "first-ish" docs across the whole collection.

-Hoss

RE: Getting facet counts for 10,000 most relevant hits

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Lan,

I figured out how to do this in  a kludgey way on the client side but it seems this could be implemented much more efficiently at the Solr/Lucene level.  I described my kludge and posted a question about this to the dev list, but so far have not received any replies (http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html).  I also found Solr-385, but I don't understand how grouping solves the problem. It looks like a much different issue to me.

The problem I am trying to solve is that I only have room in the interface to show 30 facet values at the most and whether these are ordered by facet counts against the entire result set or by the highest ranking score of a member of a facet-value group, the problem is that we want to base the facet counts/ranking on only the top N hits rather than the entire result set.  In my use case the top 10,000 hits versus all 170,000.

Tom

-----Original Message-----
From: Lan [mailto:dung.lan@gmail.com] 
Sent: Thursday, September 29, 2011 7:40 PM
To: solr-user@lucene.apache.org
Subject: Re: Getting facet counts for 10,000 most relevant hits

I implemented a similar feature for a categorization suggestion service. I
did the faceting in the client code, which is not exactly the best
performing but it worked very well.

It would be nice to have the Solr server do the faceting for performance.


Burton-West, Tom wrote:
> 
> If relevance ranking is working well, in theory it doesn't matter how many
> hits you get as long as the best results show up in the first page of
> results.  However, the default in choosing which facet values to show is
> to show the facets with the highest count in the entire result set.  Is
> there a way to issue some kind of a filter query or facet query that would
> show only the facet counts for the 10,000 most relevant search results?
> 
> As an example, if you search in our full-text collection for "jaguar" you
> get 170,000 hits.  If I am looking for the car rather than the OS or the
> animal, I might expect to be able to click on a facet and limit my results
> to the car.  However, facets containing the word car or automobile are not
> in the top 5 facets that we show.  If you click on "more"  you will see
> "automobile periodicals" but not the rest of the facets containing the
> word automobile .  This occurs because the facet counts are for all
> 170,000 hits.  The facet counts  for at least 160,000 irrelevant hits are
> included (assuming only the top 10,000 hits are relevant) .
> 
> What we would like to do is get the facet counts for the N most relevant
> documents and select the 5 or 30 facet values with the highest counts for
> those relevant documents.
> 
> Is this possible or would it require writing some lucene or Solr code?
> 
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
> 


--
View this message in context: http://lucene.472066.n3.nabble.com/Getting-facet-counts-for-10-000-most-relevant-hits-tp3363459p3380852.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting facet counts for 10,000 most relevant hits

Posted by Lan <du...@gmail.com>.
I implemented a similar feature for a categorization suggestion service. I
did the faceting in the client code, which is not exactly the best
performing but it worked very well.

It would be nice to have the Solr server do the faceting for performance.


Burton-West, Tom wrote:
> 
> If relevance ranking is working well, in theory it doesn't matter how many
> hits you get as long as the best results show up in the first page of
> results.  However, the default in choosing which facet values to show is
> to show the facets with the highest count in the entire result set.  Is
> there a way to issue some kind of a filter query or facet query that would
> show only the facet counts for the 10,000 most relevant search results?
> 
> As an example, if you search in our full-text collection for "jaguar" you
> get 170,000 hits.  If I am looking for the car rather than the OS or the
> animal, I might expect to be able to click on a facet and limit my results
> to the car.  However, facets containing the word car or automobile are not
> in the top 5 facets that we show.  If you click on "more"  you will see
> "automobile periodicals" but not the rest of the facets containing the
> word automobile .  This occurs because the facet counts are for all
> 170,000 hits.  The facet counts  for at least 160,000 irrelevant hits are
> included (assuming only the top 10,000 hits are relevant) .
> 
> What we would like to do is get the facet counts for the N most relevant
> documents and select the 5 or 30 facet values with the highest counts for
> those relevant documents.
> 
> Is this possible or would it require writing some lucene or Solr code?
> 
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
> 


--
View this message in context: http://lucene.472066.n3.nabble.com/Getting-facet-counts-for-10-000-most-relevant-hits-tp3363459p3380852.html
Sent from the Solr - User mailing list archive at Nabble.com.