You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David '-1' Schmid <da...@uni-ulm.de> on 2019/02/15 15:14:13 UTC
Suggest Component, prefix match (sur-)name
Hello solr-users!
I'm a bit stumped and after some days of trial-and-error, I've come to
the conclusion that I cannot figure this out by myself.
Where I'm at:
Solr 7.7 in cloud mode:
- 3 shards,
- 1 replication factor,
- 1 shards per node,
- 3 nodes,
- coordinated with external zookeeper
- running on three different VMs
What I do:
I'm building a search backend for academic citations, one of the most
important data are the authors. They are stored as
.. managed-schema:
.
. <fieldType name="important_strings" class="solr.StrField"
. sortMissingLast="true" docValues="true" indexed="true" stored="true"
. multiValued="true"/>
.
. <field name="author" type="important_strings"/>
.
and a random sample from the relevant data:
"author":["Stefan Diepenbrock", "Timo Ropinski", "Klaus H. Hinrichs"],
What I'd like to achieve:
I'd like to provide (auto-complete) suggestions based on the names.
Starting with the easy case:
----------------------------
Someone sends a query for
'diepen'
I'd want to match case-insensitive on all authors having 'diepen' as
prefix in their (sur-)names.
In this example, matching
'Stefan [Diepen]brock'
I got this working with defining a new field type for the suggester
.. managed-schema:
.
. <fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="1000">
. <analyzer type="index">
. <tokenizer class="solr.LowerCaseTokenizerFactory"/>
. <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
. </analyzer>
. <analyzer type="query">
. <tokenizer class="solr.LowerCaseTokenizerFactory"/>
. </analyzer>
. </fieldType>
.
and using that in the searchComponent
.. solrconfig.xml:
.
. <searchComponent name="authorsuggest" class="solr.SuggestComponent">
. <lst name="suggester">
. <str name="name">default</str>
. <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
. <str name="dictionaryImpl">DocumentDictionaryFactory</str>
. <str name="field">author</str>
. <str name="allTermsRequired">true</str>
. <str name="highlight">true</str>
. <str name="minPrefixChars">4</str>
. <str name="suggestAnalyzerFieldType">text_prefix</str>
. <str name="buildOnStartup">false</str>
. </lst>
. </searchComponent>
.
. <requestHandler name="/authors" class="solr.SearchHandler" startup="lazy">
. <lst name="defaults">
. <str name="suggest">true</str>
. <str name="suggest.count">10</str>
. </lst>
. <arr name="components">
. <str>authorsuggest</str>
. </arr>
. </requestHandler>
.
After building with
curl 'http://localhost:8983/solr/dblp/authors?suggest.build=true'
this will yield someting along the lines of
.. curl 'http://localhost:8983/solr/dblp/authors?suggest.q=Diepen'
. {
. "suggest":{"default":{
. "Diepen":{
. "numFound":10,
. "suggestions":[{
. "term":"M. Diepenhorst",
. "weight":0,
. "payload":""},
. {
. "term":"Sjoerd Diepen",
. "weight":0,
. "payload":""},
. {
. "term":"Stefan Diepenbrock",
. "weight":0,
. "payload":""},
. {
. /* abbreviated */
.
This might all have worked out by accident.
So if you see something wierd: this is what I ended up with after
running against this wall, trying out different things.
Now the tricky part:
--------------------
If someone were to type two prefixes of an author's name:
'Stef Diep' or 'Diep Stef'
I want to match these white-space seperated prefixes on all names of the
author and deliver the results were *both* prefixes match before the
others.
Because with this, curl yields:
.. curl 'http://localhost:8983/solr/dblp/authors?suggest.q=Stef%20Diep'
. {
. "suggest":{"default":{
. "Stef Diepen":{
. "numFound":10,
. "suggestions":[{
. "term":"J. Gregory Steffan",
. "weight":0,
. "payload":""},
. {
. "term":"Stefano Spaccapietra",
. "weight":0,
. "payload":""},
. {
. /* abbreviated */
.
even, when providing the full name as "suggest.q=Stefan%20Diepenbrock".
Other stuff that's weird:
- I'm getting duplicates, like ten times the same name
- Suggester results are non-deterministic
These are not as important and I guess they due to running in
cloud-mode.
I've tried:
- reading
- through some of the lucene JavaDocs, since the
solr-ref-guide is a bit sparse on information about the variables.
- the ref-guide, over and over
- many blogs based on old Solr versions (ab)using spellcheck for
suggestions,
- and several other pages I found.
- other combinations of analyzers, tokenizers and filters
- other Dict and Lookup Implementations (the wrong ones?)
but no such luck.
I hope I did not leave anything relevant out.
regards,
-1