You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David '-1' Schmid <gd...@gmail.com> on 2019/02/19 07:58:19 UTC

How to suggest prefix matches over all tokens of a field (was Re: Suggest Component, prefix match (sur-)name)

On 2019-02-18T18:12:44, David '-1' Schmid wrote:
> Will report back if that's working out.
It's working!

If anybody want's to replicate, here's what I ended up with.

.. managed-schema:
. 
. <!-- the field(Type) where the original vallues are stored in -->
. <fieldType name="important_strings" class="solr.StrField"
.   sortMissingLast="true" docValues="true" indexed="true" stored="true"
.   multiValued="true"/>
. <field name="author" type="important_strings"/>
.
. <!-- lower case tokenizatrion for case insensitive matches -->
. <fieldType name="text_lower" class="solr.TextField" multiValued="true" positionIncrementGap="100">
.   <analyzer>
.     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
.     <filter class="solr.LowerCaseFilterFactory"/>
.   </analyzer>
. </fieldType>
. <field name="author_lower" type="text_lower"/>
. <copyField source="author" dest="author_lower"/>
.
. <!-- as above but with added edgeNGrams -->
. <fieldType name="text_prefix" class="solr.TextField" multiValued="true" positionIncrementGap="100">
.   <analyzer type="index">
.     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
.     <filter class="solr.LowerCaseFilterFactory"/>
.     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
.   </analyzer>
.   <analyzer type="query">
.     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
.     <filter class="solr.LowerCaseFilterFactory"/>
.   </analyzer>
. </fieldType>
. <field name="author_ngram" type="text_prefix"/>
. <copyField source="author" dest="author_ngram"/>
.

The requestHandler uses the three fields above to provide suggestions.

.. solrconfig.xml:
. 
. <requestHandler class="solr.SearchHandler" name="/suggest_author">
.   <lst name="defaults">
.     <str name="defType">edismax</str>
.     <str name="rows">10</str>
.     <str name="fl">author</str>
.     <str name="qf">author_lower^10 author_ngram</str>
.   </lst>
. </requestHandler>
.

In case a token will match the name (or surname) of an author completely
it will boost the complete match over the partial match from
author_ngram:

Let's say I want to find "Hauck" and get a result for the first four chars.

.. curl http://localhost:8983/solr/dblp/suggest_author?q=hauc
. "docs": [
.     {
.         "author": [
.             "Gregor Hauc"
.         ]
.     },
.     {
.         "author": [
.             "Andrej Kovacic",
.             "Gregor Hauc",
.             "Brina Buh",
.             "Mojca Indihar Stemberger"
.         ]
.     },
.     {
.         "author": [
.             "Franz J. Hauck",
.             "Franz Johannes Hauck"
.         ]
.     },
.  /* ... */
. ]

once I get the last character in, it will boost complete over partial
matches:

.. curl http://localhost:8983/solr/dblp/suggest_author?q=hauck
. "docs": [
.     {
.         "author": [
.             "Rainer Hauck"
.         ]
.     },
.     {
.         "author": [
.             "Julia Hauck"
.         ]
.     },
.     {
.         "author": [
.             "Bernd Hauck"
.         ]
.     },
.  /* ... */
. ]

As these are not the persons I were looking for, I start typing the
first name:

.. curl 'http://localhost:8983/solr/dblp/suggest_author?q=hauck%20fra'
. "docs": [
.     {
.         "author": [
.             "Fra Angelico Viray"
.         ]
.     },
.     {
.         "author": [
.             "Alberto Del Fra"
.         ]
.     },
.     {
.         "author": [
.             "Alberto Del Fra"
.         ]
.     },
.  /* ... */
. ]

ohno, now my previous match was replaced by some other match.
This can be curcumvented by adding "q.op=AND" to enforce both:

.. curl 'http://localhost:8983/solr/dblp/suggest_author?q.op=AND&q=hauck%20fra'
. "docs": [
.     {
.         "author": [
.             "Franz J. Hauck",
.             "Franz Johannes Hauck"
.         ]
.     },
.  /* ... */
. ]

Which achieves what I wanted, really.
q.op can be set in solrconfig to always use AND.
Adding hl=true to the query will provide highlighting:

.. curl 'http://localhost:8983/solr/dblp/suggest_author?q.op=AND&q=hauck%20fra&hl=true'
. "highlighting": {
.     "homepages/h/FranzJHauck": {
.         "author_lower": [
.             "Franz J. <em>Hauck</em>"
.         ],
.         "author_ngram": [
.             "<em>Franz</em> J. <em>Hauck</em>"
.         ]
.     },

I'm pretty happy with this :D

The original idea came from the book "Solr in Action" by: Trey Grainger
and Timothy Potter. It's from 2014 (builds on solr 4.7), and might need some
adaptions :D

regards,
-1