You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bob Sandiford <bo...@sirsidynix.com> on 2011/04/29 21:04:57 UTC

Determining Facet Values that match the Search Term(s) - suggestions?

Hi, all.

We've indexed various types of documents, one of the fields we have is Author, and we are already able to use that as a facet, choose one of the values and further narrow by that.

Now we've been given a use case that runs something like this as an example:

1)      Choose 'Author Alphabetical' search, and enter a search term(s), for example 'Steel'

2)      Have a list of the Authors matching 'Steel' come up with a count of the number of documents associated

3)      User chooses one of those entries and then gets the document results where that Author is present.

So - it's 'sideways' in that we want essentially present facets first, with no results, choose a facet, and then show the results.  And - the facets we show have to match (in some fashion - based on our Analysis chain or based on fuzzy search) the search term(s) entered.

So - I know how to get a list of all the Author Facet values for documents where 'Steel' matches in the Author field.  The problem is - Author is a multi-valued field, and so it returns not only the Facet values that match on 'Steel', but also all the other values from the Author field.

I've come up with a really ugly approach that should work most of the time, but I'm hoping someone has a better idea here...

I've read through the Facet Parameters, and searched various other places, but haven't come across anything like this...  (I can't use the facet.prefix because I'm not looking for facets that begin with the search term(s), I'm looking for facets that contain the search term(s) - they could show up anywhere, and with the fuzzy handling, may not be exact matches anyways...)

Suggestions?



============================================================
For those masochists who want to know the approach I've come up with:

Search is something like this:

http://localhost:8983/solr/SD_ILS/select/?start=0&rows=5000&fl=JUNK&qt=standard&q=AUTHOR_boost:"mark twain"~30&facet=true&facet.mincount=1&facet.sort=index&facet.limit=-1&hl=true&hl.fl=AUTHOR_boost&hl.mergecontinuous=true&hl.snippets=5&facet.field=AUTHOR_facet&sort=id asc

Explanation:

1)      Searching for documents with the AUTHOR_boost field (our internal 'Author' field) with search term "mark twain" with a proximity distance of "30" (somewhat arbitrary).

2)      Return facet values for AUTHOR_facet field with at least one document (AUTHOR_facet is same as AUTHOR_boost as far as original content - just 'string' instead of 'text' to bypass analysis)

3)      Return up to 5000 hits (this is one of the really kludgy bits) hoping that's enough hits to span all the hits that would include "mark twain".  However, specify the "fl" (fields to return) as a field that never exists, so only getting back empty <doc /> elements in the xml.

4)      Also do highlighting on the AUTHOR_boost field which tells us what value(s) the search terms were found in

5)      Sort by the document id - just as a kind of random sort to try to get as many distinct highlighting results as possible (i.e. don't want any score type sequencing which would cluster the highlight values)

Do some post processing:

6)      Build a set of Strings from the highlighting results - removing the highlight <em> and </em> elements.  Intent is that this becomes the set of 'mark twain' type Strings.

7)      Chug through the facet_field list for AUTHOR_facet and preserve only those which have an entry in the set of strings built from the highlighting results.

8)      Present that result back to the users along with the counts from the facet...

Really ugly.  But - will usually work...

To help visualize this, here's some excerpts of the response:

<response>
 <result name="response" numFound="513" start="0">
    <doc />
    <doc />
    <doc />
       ...
       <doc />
     </result>
     <lst name="facet_counts">
       <lst name="facet_fields">
         <lst name="AUTHOR_facet">
           <int name="Adams, Joseph.">1</int>
           <int name="Addy, Wesley.">1</int>
           <int name="Albee, Josh.">1</int>
           <int name="Aldana, Raul.">1</int>
           ...
           <int name="Kern, Jerome, 1885-1945. Mark Twain suite.">1</int>
           ...
           <int name="Mark Twain Media.">1</int>
           ...
           <int name="Twain, Mark, 1835-1910">212</int>
           <int name="Twain, Mark, 1835-1910, Contributor">3</int>
           <int name="Twain, Mark, 1835-1910.">244</int>
           ...
         </lst>
       </lst>
     </lst>
     <lst name="highlighting">
       <lst name="ent://SD_ILS/0/SD_ILS:331">
         <arr name="AUTHOR_boost">
           <str><em>Twain</em>, <em>Mark</em>, 1835-1910</str>
         </arr>
       </lst>
       <lst name="ent://SD_ILS/0/SD_ILS:356">
      <arr name="AUTHOR_boost">
        <str><em>Twain</em>, <em>Mark</em>, 1835-1910</str>
      </arr>
    </lst>
    ...
    <lst name="ent://SD_ILS/104/SD_ILS:104542">
      <arr name="AUTHOR_boost">
        <str><em>Twain</em>, <em>Mark</em>, 1835-1910.</str>
      </arr>
    </lst>
    <lst name="ent://SD_ILS/11/SD_ILS:11485">
      <arr name="AUTHOR_boost">
        <str><em>Twain</em>, <em>Mark</em>, 1835-1910, Contributor</str>
      </arr>
   </lst>
    ...
    <lst name="ent://SD_ILS/482/SD_ILS:482038">
      <arr name="AUTHOR_boost">
        <str>Kern, Jerome, 1885-1945. <em>Mark</em> <em>Twain</em> suite.</str>
      </arr>
   </lst>
   ...
 </lst>
</response>


Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com
www.sirsidynix.com<http://www.sirsidynix.com>
Join the conversation - you may even get an iPad or Nook out of it!

[cid:image005.jpg@01CC066E.0A56ED60]<http://www.facebook.com/SirsiDynix>Like us on Facebook!

[cid:image006.jpg@01CC066E.0A56ED60]<http://twitter.com/#!/SirsiDynix>Follow us on Twitter!