You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2017/06/01 01:46:19 UTC

Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

Your searches against the ascii_ignorecase_string field will suffer
performance wise, SQL-like %whatever% queries have to essentially do a
table scan and assemble (conceptually) a huge OR clause consisting of
all the terms (in this case strings) that match.

Shawn's comment on using NGrams is the way this is usually done.

Best,
Erick

On Wed, May 31, 2017 at 12:28 AM, Maciej Ł. PCSS <la...@man.poznan.pl> wrote:
> Shawn, thank you for your response.
>
> Finally, my search is based on two kinds of fields (strings and text, both
> ignoring case and special characters) that potentially can contain any
> language but mainly Polish or English. This is because the two main
> requirements were:
> 1) Google-like search for quick lookups,
> 2) Precise multi-criteria search.
>
> For the first option we use the "ascii_ignorecase_text" field type (below).
> For second case we applied the "ascii_ignorecase_string". It is very often
> that the customer knows only part of an identifier / sample name / user's
> surname / address, and still the customer wants to search by that partial
> information. The application is about exploring a scientific database of
> biological samples, each of them having lots of attributes.
>
> Considering the above I'm fine with the following type definitions:
>
>   <fieldType name="ascii_ignorecase_text" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer>
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>   <fieldType name="ascii_ignorecase_string" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer>
>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> Thank you for your help!
>
> Regards
> Maciej Łabędzki
>
>
> W dniu 02.02.2017 o 16:55, Shawn Heisey pisze:
>>
>> On 2/2/2017 8:15 AM, Maciej Ł. PCSS wrote:
>>>
>>> regardless of the value of such a use-case, there is another thing
>>> that stays unknown for me.
>>>
>>> Does SOLR support a simple and silly 'exact substring match'? I mean,
>>> is it possible to search for (actually filter by) a raw substring
>>> without tokenization and without any kind of processing/simplifying
>>> the searched information? By a 'raw substring' I mean a character
>>> string that, among others, can contain non-letters (colons, brackets,
>>> etc.) - basically everything the user is able to input via keyboard.
>>>
>>> Does this use case meet SOLR technical possibilities even if that
>>> means a big efficiency cost?
>>
>> Because you want to do substring matches, things are somewhat more
>> complicated than if you wanted to do a full exact-string-only query.
>>
>> First I'll tackle the full exact query idea, because the info is also
>> important for substrings:
>>
>> If the class in the fieldType is "solr.StrField" then the input will be
>> indexed exactly as it is sent, all characters preserved, and all
>> characters needing to be in the query.
>>
>> On the query side, you would need to escape any special characters in
>> the query string -- spaces, colons, and several other characters.
>> Escaping is done with the backslash.  If you are manually constructing
>> URL parameters for an HTTP request, you would also need to be aware of
>> URL encoding.  Some Solr libraries (like SolrJ) are capable of handling
>> all the URL encoding for you.
>>
>> Matching *substrings* with StrField would involve either a regular
>> expression query (with .* before and after) or a wildcard query, which
>> Erick described in his reply.
>>
>> An alternate way to do substring matches is the NGram or EdgeNGram
>> filters, and not using wildcards or regex.  This method will increase
>> your index size, possibly by a large amount.  To use this method, you'd
>> need to switch back to solr.TextField, use the keyword tokenizer, and
>> then follow that with the appropriate NGram filter.  Depending on your
>> exact needs, you might only do the NGram filter on the index side, or
>> you might need it on both index and query analysis.  Escaping special
>> characters on the query side would still be required.
>>
>> The full list of characters that require escaping is at the end of this
>> page:
>>
>>
>> http://lucene.apache.org/core/6_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters
>>
>> Note that it shows && and || as special characters, even though these
>> are in fact two characters each.  Typically even a single instance of
>> these characters requires escaping.  Solr will also need spaces to be
>> escaped.
>>
>> Thanks,
>> Shawn
>
>