You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jan Høydahl <ja...@cominvent.com> on 2020/02/13 14:50:14 UTC

Re: [External] wildcards match end-of-word?

Be aware that if you search a field with stemming, then the index will only contain the stems, i.e. cars, caring may both be indexed as «car», and when you do a wildcard search, all analysis is skipped, so you are only targeting the exact tokens that happen to be in that field. Thus a search for «ca*s» or «c*ing» or «cars*» will not match, but «car*» and even «c*r» will match both these words, which would be surprising right? So if wildcard search is a key feature you better provide a copyField with a fieldType in your schema that do not do stemming. Probably only StandardTokenizer and LowercaseFilter. Then use that field for your wildcard queries instead of the generic stemmed field.

Jan

> 13. feb. 2020 kl. 13:52 skrev Fischer, Stephen <sf...@pennmedicine.upenn.edu>:
> 
> Folks,
> 
> I am seeing very strange (bad) wildcard behavior (solr 8).  
> 
> "kinase" finds hits as expected.  
> 
> "kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like "kinase," and "kinase-" but not "kinase"
> 
> I have done the analysis as Erick suggested (thanks!) but it is not helping me understand why we'd have this problem.
> 
> I have put together 12 screenshots from the Solr web UI that show in detail:
> - the queries I ran to get the results above
> - various analyses trying to understand why
> - the schema for the fieldType in question
> 
> https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing
> 
> thanks,
> steve
> 
> -----Original Message-----
> From: Sotiris Fragkiskos <sf...@gmail.com> 
> Sent: Thursday, February 13, 2020 4:03 AM
> To: solr-user@lucene.apache.org
> Subject: [External] Re: wildcards match end-of-word?
> 
> Hi Erick,
> thanks very much for this information, it was immensely useful, I always had the same question!
> I'm now seeing the Analysis page and finally I don't have to rely on an external online stemmer to see what solr *probably* stemmed the term to!!
> But I still can't make the asterisk and question mark work inside the term, even in the earlier parts of it.
> e.g. tr?ining
> I would expect it to match train. But it doesn't.
> PSF at the end just shows t | ain
> every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF) Am I doing something very wrong??
> 
> thanks again!
> Sotiri
> 
> On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> Steve:
>> 
>> You _really_ want to get acquainted with the admin UI/Analysis page ;).
>> Choose a core/collection and you should see the choice. It shows you 
>> exactly what transformations your data goes through. If you hover over 
>> the light gray pairs of letters, you’ll get a tooltip showing you what 
>> part of your analysis chain is responsible for a particular change. I 
>> un-check the “verbose” box 95% of the time BTW.
>> 
>> The critical bit is that what comes out of the end of the analysis 
>> pipe are the tokens that are actually _in_ the index. From there, 
>> problems like this make more sense.
>> 
>> My bet is that, as Walter says, you have a stemmer in the analysis 
>> chain and the actual token in the index is “kinas” so of course 
>> “kinase*” won’t be found. By adding OR kinase to the query, that token 
>> is stemmed to “kinas” and matches.
>> 
>> Also, adding &debug=query to your URL will show you what the query 
>> looks like after parsing and analysis, also a major tool for figuring 
>> out what’s really happening.
>> 
>> Wildcards are not stemmed, which can lead to surprising results. 
>> There’s no perfect answer here. Let’s claim wildcards _were_ stemmed. 
>> Then you’d have to try to explain why “running*” returned a doc with 
>> only “run” or “runner” or “runs” or... in it, but searching for 
>> “runnin*” did not due the stemmer not recognizing it as a stemmable word.
>> 
>> Finally, one of my personal hot buttons is wildcards in general. 
>> They’re very often over-used because people are used to simple search capabilities.
>> Something about “if your only tool is a hammer, every problem looks 
>> like a nail”. That gets into training users too though...
>> 
>> Best,
>> Erick
>> 
>>> On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
>> sfischer@pennmedicine.upenn.edu> wrote:
>>> 
>>> Hi,
>>> 
>>> I am a solr newbie.  I was surprised to discover that a search for
>> kinase* returned fewer results than kinase.
>>> 
>>> Then I read the wildcard documentation<
>> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
>> l#TheStandardQueryParser-WildcardSearches>,
>> and saw why.  kinase* will not match the word "kinase".
>>> 
>>> Our end-users won't expect this behavior.  Presumably the solution 
>>> would
>> be for them (actually us, on their behalf), to use kinase* OR kinase.
>>> 
>>> But that is kind of a hack.
>>> 
>>> Is there a way we can configure solr to have wildcards match on
>> end-of-word?
>>> 
>>> Thanks,
>>> Steve
>> 
>>