You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David '-1' Schmid <da...@uni-ulm.de> on 2019/02/15 15:14:13 UTC
Suggest Component, prefix match (sur-)name

Hello solr-users!

I'm a bit stumped and after some days of trial-and-error, I've come to
the conclusion that I cannot figure this out by myself.

Where I'm at:

Solr 7.7 in cloud mode:
- 3 shards,
- 1 replication factor,
- 1 shards per node,
- 3 nodes,
  - coordinated with external zookeeper
  - running on three different VMs

What I do:

I'm building a search backend for academic citations, one of the most
important data are the authors. They are stored as 

.. managed-schema:
. 
. <fieldType name="important_strings" class="solr.StrField"
.   sortMissingLast="true" docValues="true" indexed="true" stored="true"
.   multiValued="true"/>
. 
. <field name="author"    type="important_strings"/>
.

and a random sample from the relevant data:

 "author":["Stefan Diepenbrock", "Timo Ropinski", "Klaus H. Hinrichs"],


What I'd like to achieve:

I'd like to provide (auto-complete) suggestions based on the names.


Starting with the easy case:
----------------------------
Someone sends a query for
  'diepen'

I'd want to match case-insensitive on all authors having 'diepen' as
prefix in their (sur-)names.
In this example, matching
  'Stefan [Diepen]brock'

I got this working with defining a new field type for the suggester

.. managed-schema:
.
. <fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="1000">
.   <analyzer type="index">
.     <tokenizer class="solr.LowerCaseTokenizerFactory"/>
.     <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
.   </analyzer>
.   <analyzer type="query">
.     <tokenizer class="solr.LowerCaseTokenizerFactory"/>
.   </analyzer>
. </fieldType>
.

and using that in the searchComponent

.. solrconfig.xml:
.
. <searchComponent name="authorsuggest" class="solr.SuggestComponent">
.   <lst name="suggester">
.     <str name="name">default</str>
.     <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
.     <str name="dictionaryImpl">DocumentDictionaryFactory</str>
.     <str name="field">author</str>
.     <str name="allTermsRequired">true</str>
.     <str name="highlight">true</str>
.     <str name="minPrefixChars">4</str>
.     <str name="suggestAnalyzerFieldType">text_prefix</str>
.     <str name="buildOnStartup">false</str>
.   </lst>
. </searchComponent>
.
. <requestHandler name="/authors" class="solr.SearchHandler" startup="lazy">
.   <lst name="defaults">
.     <str name="suggest">true</str>
.     <str name="suggest.count">10</str>
.   </lst>
.   <arr name="components">
.     <str>authorsuggest</str>
.   </arr>
. </requestHandler>
.

After building with
  curl 'http://localhost:8983/solr/dblp/authors?suggest.build=true'
this will yield someting along the lines of

.. curl 'http://localhost:8983/solr/dblp/authors?suggest.q=Diepen'           
. {
.   "suggest":{"default":{
.       "Diepen":{
.         "numFound":10,
.         "suggestions":[{
.             "term":"M. Diepenhorst",
.             "weight":0,
.             "payload":""},
.           {
.             "term":"Sjoerd Diepen",
.             "weight":0,
.             "payload":""},
.           {
.             "term":"Stefan Diepenbrock",
.             "weight":0,
.             "payload":""},
.           {
.            /* abbreviated */
.

This might all have worked out by accident.
So if you see something wierd: this is what I ended up with after
running against this wall, trying out different things.


Now the tricky part:
--------------------

If someone were to type two prefixes of an author's name:

  'Stef Diep' or 'Diep Stef'

I want to match these white-space seperated prefixes on all names of the
author and deliver the results were *both* prefixes match before the
others.

Because with this, curl yields:

.. curl 'http://localhost:8983/solr/dblp/authors?suggest.q=Stef%20Diep'
. {
.   "suggest":{"default":{
.       "Stef Diepen":{
.         "numFound":10,
.         "suggestions":[{
.             "term":"J. Gregory Steffan",
.             "weight":0,
.             "payload":""},
.           {
.             "term":"Stefano Spaccapietra",
.             "weight":0,
.             "payload":""},
.           {
.            /* abbreviated */
.

even, when providing the full name as "suggest.q=Stefan%20Diepenbrock".

Other stuff that's weird:
- I'm getting duplicates, like ten times the same name
- Suggester results are non-deterministic

These are not as important and I guess they due to running in
cloud-mode.

I've tried:
- reading
  - through some of the lucene JavaDocs, since the
    solr-ref-guide is a bit sparse on information about the variables.
  - the ref-guide, over and over
  - many blogs based on old Solr versions (ab)using spellcheck for
    suggestions,
  - and several other pages I found.
- other combinations of analyzers, tokenizers and filters
- other Dict and Lookup Implementations (the wrong ones?)

but no such luck.

I hope I did not leave anything relevant out.

regards,
-1