You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Steven White <sw...@gmail.com> on 2020/09/10 15:18:58 UTC

Why use a different analyzer for "index" and "query"?

Hi everyone,

In Solr's schema, I have come across field types that use a different logic
for "index" than for "query".  To be clear, I"m talking about this block:

    <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
   <!-- what you see in this block doesn't have to be the same as what you
see inside "query" block -->
      </analyzer>
      <analyzer type="query">
   <!-- what you see in this block doesn't have to be the same as what you
see inside "index" block -->
      </analyzer>
    </fieldType>

Why would one want to not use the same logic for both and simply use:

    <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
   <!-- same logic to be used by for "index" and "query" -->
      </analyzer>
    </fieldType>

What are real word use cases to use a different analyzer for index and
query?

Thanks,

Steve

Re: Why use a different analyzer for "index" and "query"?

Posted by Tim Casey <tc...@gmail.com>.

People usually want to do some analysis during index time.  This analysis
should be considered 'expensive', compared to any single query run.  You
can think of it as indexing every day, over a 86400 second day, vs a 200 ms
query time.

Normally, you want to index as honestly as possible.  That is, you want to
take what you are given and put it in the index they way it comes.  You do
this with a particular analyzer.  This produces a token stream, which is
then indexed.  (Solr does things way more complicated now, like two tokens
with the same index position and so on.  But a simple model to give a
foundational explanation.)

On the query side you can try all kinds of crazy things to find what you
want.  You can build synonyms at this point and query for them all.  You
can stem words, and query and so on.  You can build distance queries, two
words nearish to each other.

If you produce more tokens at index time, you are increasing the over all
documents returned, and assuming a single set of documents is the desired
search result, this will result in lower precision.  You will not always be
able to find the thing you want in the fixed set of early query results.
The only way to fix this is at index time.  It is much easier to make this
adjustment at query time.  Instead of stemming, make the query more exact
hopefully increasing precision.

This difference in cost leads to a tendency, over the time of a search
universe, to tend towards more complex queries and less complex indexing.

I would recommend avoiding indexing tricks for this reason.  If they are
required, and I am sure they are, then you may want to segment the queries
in such a way as to be able to answer over generation over the required
recall.  So, segment the differences by field.  Put time tokens in a time
field, so you dont get names of people 'june' while searching for 'jun',
for instance.

tim

On Thu, Sep 10, 2020 at 10:08 AM Walter Underwood <wu...@wunderwood.org>
wrote:

> It is very common for us to do more processing in the index analysis
> chain. In general, we do that when we want additional terms in the index to
> be searchable. Some examples:
>
> * synonyms: If the book title is “EMT” add “Emergency Medical Technician”.
> * ngrams: For prefix matching, generate all edge ngrams, for example for
> “french” add “f”, “fr” “fre”, “fren”, and “frenc”.
> * shingles: Make pairs, so the query “babysitter” can match “baby sitter”.
> * split on delimiters: break up compounds, so “baby sitter” can match
> “baby-sitter”. Do this before shingles and you get matches for
> “babysitter”, “baby-sitter”, and “baby sitter”.
> * remove HTML: we rarely see HTML in queries, but we never know when
> someone will get clever with the source text, sigh.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Sep 10, 2020, at 9:48 AM, Erick Erickson <er...@gmail.com>
> wrote:
> >
> > When you want to do something different and index and query time. There,
> an answer that’s almost, but not quite, completely useless while being
> accurate ;)
> >
> > A concrete example is synonyms as have been mentioned. Say you have an
> index-time synonym definition of
> > A,B,C
> >
> > These three tokens will be “stacked” in the index wherever any of them
> are found.
> > A query "q=field:B” would find a document with any of the three tokens
> in the original. It would be wasteful for the query to be transformed into
> “q=field:(A B C)”…
> >
> > And take a very close look at WordDelimiterGraphFilterFactory. I’m
> pretty sure you’ll find the parameters are different. Say the parameters
> for the input 123-456-7890 cause WDGFF to add
> > 123, 456, 7890, 1234567890 to the index. Again, at query time you don’t
> need to repeat and have all of those tokens in the search itself.
> >
> > Best,
> > Erick
> >
> >> On Sep 10, 2020, at 12:41 PM, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
> >>
> >> There are a lot of different use cases and the separate analyzers for
> >> indexing and query is part of the Solr power. For example, you could
> >> apply ngram during indexing time to generate multiple substrings. But
> >> you don't want to do that during the query, because otherwise you are
> >> matching on 'shared prefix' instead of on what user entered. Thinking
> >> phone number directory where people may enter any suffix and you want
> >> to match it.
> >> See for example
> >>
> https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
> >> , starting slide 16 onwards.
> >>
> >> Or, for non-production but fun use case:
> >>
> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
> >> (search phonetically mapped Thai text in English).
> >>
> >> Similarly, you may want to apply synonyms at query time only if you
> >> want to avoid diluting some relevancy. Or at index type to normalize
> >> spelling and help relevancy.
> >>
> >> Or you may want to be doing some accent folding for sorting or
> >> faceting (which uses indexed tokens).
> >>
> >> Regards,
> >>  Alex.
> >>
> >> On Thu, 10 Sep 2020 at 11:19, Steven White <sw...@gmail.com>
> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> In Solr's schema, I have come across field types that use a different
> logic
> >>> for "index" than for "query".  To be clear, I"m talking about this
> block:
> >>>
> >>>   <fieldType name="text_en" class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>     <analyzer type="index">
> >>>  <!-- what you see in this block doesn't have to be the same as what
> you
> >>> see inside "query" block -->
> >>>     </analyzer>
> >>>     <analyzer type="query">
> >>>  <!-- what you see in this block doesn't have to be the same as what
> you
> >>> see inside "index" block -->
> >>>     </analyzer>
> >>>   </fieldType>
> >>>
> >>> Why would one want to not use the same logic for both and simply use:
> >>>
> >>>   <fieldType name="text_en" class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>     <analyzer>
> >>>  <!-- same logic to be used by for "index" and "query" -->
> >>>     </analyzer>
> >>>   </fieldType>
> >>>
> >>> What are real word use cases to use a different analyzer for index and
> >>> query?
> >>>
> >>> Thanks,
> >>>
> >>> Steve
> >
>
>

Re: Why use a different analyzer for "index" and "query"?

Posted by Walter Underwood <wu...@wunderwood.org>.

It is very common for us to do more processing in the index analysis chain. In general, we do that when we want additional terms in the index to be searchable. Some examples:

* synonyms: If the book title is “EMT” add “Emergency Medical Technician”.
* ngrams: For prefix matching, generate all edge ngrams, for example for “french” add “f”, “fr” “fre”, “fren”, and “frenc”.
* shingles: Make pairs, so the query “babysitter” can match “baby sitter”.
* split on delimiters: break up compounds, so “baby sitter” can match “baby-sitter”. Do this before shingles and you get matches for “babysitter”, “baby-sitter”, and “baby sitter”.
* remove HTML: we rarely see HTML in queries, but we never know when someone will get clever with the source text, sigh.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 10, 2020, at 9:48 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> When you want to do something different and index and query time. There, an answer that’s almost, but not quite, completely useless while being accurate ;)
> 
> A concrete example is synonyms as have been mentioned. Say you have an index-time synonym definition of
> A,B,C
> 
> These three tokens will be “stacked” in the index wherever any of them are found. 
> A query "q=field:B” would find a document with any of the three tokens in the original. It would be wasteful for the query to be transformed into “q=field:(A B C)”…
> 
> And take a very close look at WordDelimiterGraphFilterFactory. I’m pretty sure you’ll find the parameters are different. Say the parameters for the input 123-456-7890 cause WDGFF to add
> 123, 456, 7890, 1234567890 to the index. Again, at query time you don’t need to repeat and have all of those tokens in the search itself.
> 
> Best,
> Erick
> 
>> On Sep 10, 2020, at 12:41 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>> 
>> There are a lot of different use cases and the separate analyzers for
>> indexing and query is part of the Solr power. For example, you could
>> apply ngram during indexing time to generate multiple substrings. But
>> you don't want to do that during the query, because otherwise you are
>> matching on 'shared prefix' instead of on what user entered. Thinking
>> phone number directory where people may enter any suffix and you want
>> to match it.
>> See for example
>> https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
>> , starting slide 16 onwards.
>> 
>> Or, for non-production but fun use case:
>> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
>> (search phonetically mapped Thai text in English).
>> 
>> Similarly, you may want to apply synonyms at query time only if you
>> want to avoid diluting some relevancy. Or at index type to normalize
>> spelling and help relevancy.
>> 
>> Or you may want to be doing some accent folding for sorting or
>> faceting (which uses indexed tokens).
>> 
>> Regards,
>>  Alex.
>> 
>> On Thu, 10 Sep 2020 at 11:19, Steven White <sw...@gmail.com> wrote:
>>> 
>>> Hi everyone,
>>> 
>>> In Solr's schema, I have come across field types that use a different logic
>>> for "index" than for "query".  To be clear, I"m talking about this block:
>>> 
>>>   <fieldType name="text_en" class="solr.TextField"
>>> positionIncrementGap="100">
>>>     <analyzer type="index">
>>>  <!-- what you see in this block doesn't have to be the same as what you
>>> see inside "query" block -->
>>>     </analyzer>
>>>     <analyzer type="query">
>>>  <!-- what you see in this block doesn't have to be the same as what you
>>> see inside "index" block -->
>>>     </analyzer>
>>>   </fieldType>
>>> 
>>> Why would one want to not use the same logic for both and simply use:
>>> 
>>>   <fieldType name="text_en" class="solr.TextField"
>>> positionIncrementGap="100">
>>>     <analyzer>
>>>  <!-- same logic to be used by for "index" and "query" -->
>>>     </analyzer>
>>>   </fieldType>
>>> 
>>> What are real word use cases to use a different analyzer for index and
>>> query?
>>> 
>>> Thanks,
>>> 
>>> Steve
>

Re: Why use a different analyzer for "index" and "query"?

Posted by Erick Erickson <er...@gmail.com>.

When you want to do something different and index and query time. There, an answer that’s almost, but not quite, completely useless while being accurate ;)

A concrete example is synonyms as have been mentioned. Say you have an index-time synonym definition of
A,B,C

These three tokens will be “stacked” in the index wherever any of them are found. 
A query "q=field:B” would find a document with any of the three tokens in the original. It would be wasteful for the query to be transformed into “q=field:(A B C)”…

And take a very close look at WordDelimiterGraphFilterFactory. I’m pretty sure you’ll find the parameters are different. Say the parameters for the input 123-456-7890 cause WDGFF to add
123, 456, 7890, 1234567890 to the index. Again, at query time you don’t need to repeat and have all of those tokens in the search itself.

Best,
Erick

> On Sep 10, 2020, at 12:41 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
> 
> There are a lot of different use cases and the separate analyzers for
> indexing and query is part of the Solr power. For example, you could
> apply ngram during indexing time to generate multiple substrings. But
> you don't want to do that during the query, because otherwise you are
> matching on 'shared prefix' instead of on what user entered. Thinking
> phone number directory where people may enter any suffix and you want
> to match it.
> See for example
> https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
> , starting slide 16 onwards.
> 
> Or, for non-production but fun use case:
> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
> (search phonetically mapped Thai text in English).
> 
> Similarly, you may want to apply synonyms at query time only if you
> want to avoid diluting some relevancy. Or at index type to normalize
> spelling and help relevancy.
> 
> Or you may want to be doing some accent folding for sorting or
> faceting (which uses indexed tokens).
> 
> Regards,
>   Alex.
> 
> On Thu, 10 Sep 2020 at 11:19, Steven White <sw...@gmail.com> wrote:
>> 
>> Hi everyone,
>> 
>> In Solr's schema, I have come across field types that use a different logic
>> for "index" than for "query".  To be clear, I"m talking about this block:
>> 
>>    <fieldType name="text_en" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>   <!-- what you see in this block doesn't have to be the same as what you
>> see inside "query" block -->
>>      </analyzer>
>>      <analyzer type="query">
>>   <!-- what you see in this block doesn't have to be the same as what you
>> see inside "index" block -->
>>      </analyzer>
>>    </fieldType>
>> 
>> Why would one want to not use the same logic for both and simply use:
>> 
>>    <fieldType name="text_en" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer>
>>   <!-- same logic to be used by for "index" and "query" -->
>>      </analyzer>
>>    </fieldType>
>> 
>> What are real word use cases to use a different analyzer for index and
>> query?
>> 
>> Thanks,
>> 
>> Steve

Re: Why use a different analyzer for "index" and "query"?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

There are a lot of different use cases and the separate analyzers for
indexing and query is part of the Solr power. For example, you could
apply ngram during indexing time to generate multiple substrings. But
you don't want to do that during the query, because otherwise you are
matching on 'shared prefix' instead of on what user entered. Thinking
phone number directory where people may enter any suffix and you want
to match it.
See for example
https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
, starting slide 16 onwards.

Or, for non-production but fun use case:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
 (search phonetically mapped Thai text in English).

Similarly, you may want to apply synonyms at query time only if you
want to avoid diluting some relevancy. Or at index type to normalize
spelling and help relevancy.

Or you may want to be doing some accent folding for sorting or
faceting (which uses indexed tokens).

Regards,
   Alex.

On Thu, 10 Sep 2020 at 11:19, Steven White <sw...@gmail.com> wrote:
>
> Hi everyone,
>
> In Solr's schema, I have come across field types that use a different logic
> for "index" than for "query".  To be clear, I"m talking about this block:
>
>     <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>    <!-- what you see in this block doesn't have to be the same as what you
> see inside "query" block -->
>       </analyzer>
>       <analyzer type="query">
>    <!-- what you see in this block doesn't have to be the same as what you
> see inside "index" block -->
>       </analyzer>
>     </fieldType>
>
> Why would one want to not use the same logic for both and simply use:
>
>     <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>    <!-- same logic to be used by for "index" and "query" -->
>       </analyzer>
>     </fieldType>
>
> What are real word use cases to use a different analyzer for index and
> query?
>
> Thanks,
>
> Steve

Re: Why use a different analyzer for "index" and "query"?

Posted by Stavros Macrakis <ma...@alum.mit.edu>.

I gave an example of why you might want to analyze the corpus differently
from the query just yesterday -- see
https://lucene.472066.n3.nabble.com/Lowercase-ing-everything-but-acronyms-td4462899.html

              -s

On Thu, Sep 10, 2020 at 11:19 AM Steven White <sw...@gmail.com> wrote:

> Hi everyone,
>
> In Solr's schema, I have come across field types that use a different logic
> for "index" than for "query".  To be clear, I"m talking about this block:
>
>     <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>    <!-- what you see in this block doesn't have to be the same as what you
> see inside "query" block -->
>       </analyzer>
>       <analyzer type="query">
>    <!-- what you see in this block doesn't have to be the same as what you
> see inside "index" block -->
>       </analyzer>
>     </fieldType>
>
> Why would one want to not use the same logic for both and simply use:
>
>     <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>    <!-- same logic to be used by for "index" and "query" -->
>       </analyzer>
>     </fieldType>
>
> What are real word use cases to use a different analyzer for index and
> query?
>
> Thanks,
>
> Steve
>

Re: Why use a different analyzer for "index" and "query"?

Posted by Thomas Corthals <th...@klascement.net>.

Hi Steve

I have a real-world use case. We don't apply a synonym filter at index
time, but we do apply a managed synonym filter at query time. This allows
content managers to add new synonyms (or remove existing ones) "on the fly"
without having to reindex any documents.

Thomas

Op do 10 sep. 2020 om 17:29 schreef Dunham-Wilkie, Mike CITZ:EX <
Mike.Dunham-Wilkie@gov.bc.ca>:

> Hi Steven,
>
> I can think of one case.  If we have an index of database table or column
> names, e.g., words like 'THIS_IS_A_TABLE_NAME', we may want to split the
> name at the underscores when indexing (as well as keep the original), since
> the individual parts might be significant and meaningful.  When querying,
> though, if the searcher types in THIS_IS_A_TABLE_NAME then they are likely
> looking for the whole string, so we wouldn't want to split it apart.
>
> There also seems to be a debate on whether the SYNONYM filter should be
> included on indexing, on querying, or on both.  Google "solr synonyms index
> vs query"
>
> Mike
>
> -----Original Message-----
> From: Steven White <sw...@gmail.com>
> Sent: September 10, 2020 8:19 AM
> To: solr-user@lucene.apache.org
> Subject: Why use a different analyzer for "index" and "query"?
>
> [EXTERNAL] This email came from an external source. Only open attachments
> or links that you are expecting from a known sender.
>
>
> Hi everyone,
>
> In Solr's schema, I have come across field types that use a different
> logic for "index" than for "query".  To be clear, I"m talking about this
> block:
>
>     <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>    <!-- what you see in this block doesn't have to be the same as what you
> see inside "query" block -->
>       </analyzer>
>       <analyzer type="query">
>    <!-- what you see in this block doesn't have to be the same as what you
> see inside "index" block -->
>       </analyzer>
>     </fieldType>
>
> Why would one want to not use the same logic for both and simply use:
>
>     <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>    <!-- same logic to be used by for "index" and "query" -->
>       </analyzer>
>     </fieldType>
>
> What are real word use cases to use a different analyzer for index and
> query?
>
> Thanks,
>
> Steve
>

RE: Why use a different analyzer for "index" and "query"?

Posted by "Dunham-Wilkie, Mike CITZ:EX" <Mi...@gov.bc.ca>.

Hi Steven, 

I can think of one case.  If we have an index of database table or column names, e.g., words like 'THIS_IS_A_TABLE_NAME', we may want to split the name at the underscores when indexing (as well as keep the original), since the individual parts might be significant and meaningful.  When querying, though, if the searcher types in THIS_IS_A_TABLE_NAME then they are likely looking for the whole string, so we wouldn't want to split it apart.

There also seems to be a debate on whether the SYNONYM filter should be included on indexing, on querying, or on both.  Google "solr synonyms index vs query"

Mike

-----Original Message-----
From: Steven White <sw...@gmail.com> 
Sent: September 10, 2020 8:19 AM
To: solr-user@lucene.apache.org
Subject: Why use a different analyzer for "index" and "query"?

[EXTERNAL] This email came from an external source. Only open attachments or links that you are expecting from a known sender.


Hi everyone,

In Solr's schema, I have come across field types that use a different logic for "index" than for "query".  To be clear, I"m talking about this block:

    <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
   <!-- what you see in this block doesn't have to be the same as what you see inside "query" block -->
      </analyzer>
      <analyzer type="query">
   <!-- what you see in this block doesn't have to be the same as what you see inside "index" block -->
      </analyzer>
    </fieldType>

Why would one want to not use the same logic for both and simply use:

    <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
   <!-- same logic to be used by for "index" and "query" -->
      </analyzer>
    </fieldType>

What are real word use cases to use a different analyzer for index and query?

Thanks,

Steve