You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Miguel Joy <Mi...@amexgbt.com.INVALID> on 2022/09/27 09:55:10 UTC

Solr Search - Mixed Case Issue

Hi all,

I'm new to Solr and recently inherited a Solr application (version 5.4) from a previous developer with very little documentation.  At any rate, my problem is this:

I have some email addresses that are stored as mixed case.

Tom.Jones@acme.com<ma...@acme.com> = Success [querying for this email address and passing in the full email address in any case [upper or lower] returns the correct result]

Kevin.McNeil@acme.com<ma...@acme.com> = Fail [querying for this email address and passing in the full email address in any case [upper or lower] returns zero results]

And here's the fieldType definition that's used for email addresses:

<fieldType name="text_phonetic" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
                <filter class="solr.PhoneticFilterFactory" encoder="Caverphone" inject="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"/>
                                <filter class="solr.PhoneticFilterFactory" encoder="Caverphone" inject="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

I've spent a couple days researching this issue, and my best guess at a fix would be to re-index this data using the LowerCaseFilterFatory so that all email addresses are stored in lower case, but that would be a significant change as I have over 10 million docs indexed.  In addition, its strange that we get search results on some mixed case email addresses, but not all, so I'm hoping that maybe all we need is to tweak the query analyzer?  Thanks in advance for your help with this question.  Please let me know if you need any additional details.

-Miguel



________________________________

Notice: GBT Travel Services UK Limited (GBT UK) and its authorised sublicensees (including Ovation Travel Group and Egencia) use certain trademarks and service marks of American Express Company or its subsidiaries (American Express) in the 'American Express Global Business Travel' and 'American Express Meetings & Events' brands and in connection with its business for permitted uses only under a limited licence from American Express (Licensed Marks). The Licensed Marks are trademarks or service marks of, and the property of, American Express. GBT UK is a subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American Express holds a minority interest in GBTG, which operates as a separate company from American Express.

________________________________

This email message and all attachments transmitted with it are solely for the use of the intended recipient(s) and may contain confidential and/or privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, copying and/or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender and delete it immediately. Unintended transmission shall not constitute a waiver of the attorney-client or any other privilege.

________________________________
Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de sous-licence autoris?s (notamment Ovation Travel Group et Egencia) utilise certaines marques commerciales et marques de services d'American Express Company ou de ses filiales (American Express) dans les marques < American Express Global Business Travel > et < American Express Meetings & Events > ainsi qu'en lien avec son activit?, ? des fins autoris?es uniquement, sous une licence limit?e accord?e par American Express (marques sous licence). Les marques sous licence sont des marques commerciales ou des marques de services d'American Express, dont elles sont la propri?t?. GBT UK est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American Express d?tient une participation minoritaire dans GBTG, qui op?re en tant que soci?t? distincte d'American Express.

________________________________

Ce message ?lectronique et toutes les pi?ces jointes transmises avec celui-ci sont uniquement destin?s ? l'usage du ou des destinataires vis?s et peuvent contenir des informations confidentielles et/ou privil?gi?es. Si le lecteur de ce message n'est pas le destinataire pr?vu, vous ?tes inform? par la pr?sente que toute diffusion, distribution, copie et/ou autre utilisation de ce message ou de ses pi?ces jointes est strictement interdite. Si vous avez re?u ce message par erreur, veuillez en informer l'exp?diteur et le supprimer imm?diatement. Une transmission involontaire ne constitue pas une renonciation au secret professionnel ou ? toute autre pr?rogative.

________________________________

Re: Solr Search - Mixed Case Issue

Posted by Walter Underwood <wu...@wunderwood.org>.
I’ve learned these things the hard way from weird behavior in production, mostly due to my own mistakes.

I had to debug some really strange results from my configs at Netflix. It turns out that you don’t want the movie “Saw” to match “see”, for example. :-) And there were several movie titles that completely disappeared after stopword removal. Oops. I wrote that up here:

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

My favorite was "To Be and To Have (Être et Avoir)” which is all-stopwords in two languages. A great movie, too.

The biggest hassle was a movie titled “+/-“, but that is a different problem.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 27, 2022, at 12:49 PM, Miguel Joy <Mi...@amexgbt.com.INVALID> wrote:
> 
> Hi Walter,
> 
> Thanks very much for your honest feedback.  As I mentioned, I inherited this application so I've been trying to pick up the pieces as best I can.  The solr analysis tool is great, so it's now clear to me how to make changes to the analysis chain and test them using the analysis tool.  I suspect we'll end up having to clean this configuration up and re-index the documents.  Again, thanks to you and Markus for the support.  I have what I need now.
> 
> -Miguel
> 
> -----Original Message-----
> From: Walter Underwood <wu...@wunderwood.org>
> Sent: Tuesday, September 27, 2022 2:25 PM
> To: users@solr.apache.org
> Subject: Re: Solr Search - Mixed Case Issue
> 
> CAUTION: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and expect that the content is safe.
> 
> Honestly, this analysis chain is a mess.
> 
> * StandardTokenizer has parsing support for email addresses, so that is a better choice.
> * Never mix phonetic transformation and stemming, use different chains. Phonetic tokens aren’t stemmable.
> * Don’t stem email addresses.
> * Don’t do phonetic transforms on email addresses unless you really want that.
> * Don’t remove stopwords ever, but especially for email addresses.
> * Don’t do word delimiter splitting on email addresses unless you really want that.
> 
> For stopwords, let’s assume that “in” is in stopwords.txt. That means it corrupts every email address from India.
> 
> Instead, use a chain that looks like this. You shouldn’t need separate index and query chains.
> 
> * StandardTokenizerFactory
> * LowercaseFilterFactory
> 
> Using HTMLStripCharFilterFactory for preprocessing probably doesn’t hurt, but shouldn’t be necessary. If someone is using “&gt;” in your content or queries, things are a little weird.
> 
> I do like to use Unicode normalization to take care of stuff like curly quotes. That also has more complete lowercasing support. You’ll probably need to include the ICU libraries.
> 
> <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose”/>
> 
> To test, make an analysis chain like this, then use the analysis tool in the UI to see if it does what you want. If it does that, then you can reindex.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org <ma...@wunderwood.org>
> https://urldefense.com/v3/__http://observer.wunderwood.org/__;!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE3lpDPnG0$    (my blog)
> 
>> On Sep 27, 2022, at 8:06 AM, Markus Jelsma <markus.jelsma@openindex.io <ma...@openindex.io>> wrote:
>> 
>> Hello Miguel,
>> 
>> That's likely due to catenateAll/catenateWords. McNeil is first split
>> so you can find it using 'mc neil', but not 'mcneil'. Using the
>> catenate*-settings the split terms 'McNeil' into 'mc' 'neil' can
>> become 'mcneil' again.
>> 
>> If you haven't already, use Solr's analysis GUI [1] for testing these
>> configurations. It shows step by step what becomes of the index- and
>> query-time analysis chains, and if they match up in the end.
>> 
>> Regards,
>> Markus
>> 
>> [1] https://urldefense.com/v3/__http://localhost:8983/solr/*/COLLECTION/analysis__;Iw!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE35nq4Sgo$   <https://urldefense.com/v3/__http://localhost:8983/solr/*/COLLECTION/analysis__;Iw!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE35nq4Sgo$  >
>> 
>> Op di 27 sep. 2022 om 16:54 schreef Miguel Joy
>> <Miguel.Joy@amexgbt.com.invalid <ma...@amexgbt.com.invalid>>:
>> 
>>> Hi Markus,
>>> 
>>> Thanks so much for your recommendations.  Matching the
>>> splitOnCaseChange attributes  index-time with the query-time, partially fixed our issue.
>>> Now, if I search for Kevin.McNeil@acme.com
>>> <ma...@acme.com> and provide the exact same case as the
>>> email is stored I get a successful result!  However, if I search using kevin.mcneil@acme.com <ma...@acme.com> (all lower-case), it doesn't match.
>>> Essentially, only if I search using the exact same case as the email
>>> is stored do I get results.  Any additional ideas on how I can get
>>> the email search to fully work?  Thanks again for your help.
>>> 
>>> -Miguel
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Miguel Joy
>>> Sent: Tuesday, September 27, 2022 6:43 AM
>>> To: users@solr.apache.org <ma...@solr.apache.org>
>>> Subject: RE: Solr Search - Mixed Case Issue
>>> 
>>> Hi Markus,
>>> 
>>> Thanks for your prompt reply to my issue.  I will try your
>>> suggestions and report back.
>>> 
>>> Thanks,
>>> -Miguel
>>> 
>>> -----Original Message-----
>>> From: Markus Jelsma <markus.jelsma@openindex.io
>>> <ma...@openindex.io>>
>>> Sent: Tuesday, September 27, 2022 6:36 AM
>>> To: users@solr.apache.org <ma...@solr.apache.org>
>>> Subject: Re: Solr Search - Mixed Case Issue
>>> 
>>> CAUTION: This email originated from outside the organization. Do not
>>> click links or open attachments unless you recognize the sender and
>>> expect that the content is safe.
>>> 
>>> Hello Miguel,
>>> 
>>> The problem lies with the different index-time and query-time
>>> WordDelimiterFilter configurations.
>>> 
>>>> In addition, its strange that we get search results on some mixed
>>>> case
>>> email addresses
>>> 
>>> Yes, precisely!
>>> 
>>> See the splitOnCaseChange attributes, that is where the problem is.
>>> In your case you should be able to copy the index-time configuration
>>> to the query-time and get rid of the problem without reindex. It
>>> 'should' solve the problem. If not, try to enable catenateAll, on
>>> both sides, but that requires reindex.
>>> 
>>> Ideally you should probably also get rid of the StopFilterFactory,
>>> unless very well configured (which i do not suspect) it will cause
>>> additional weird problems. This does require reindexing.
>>> 
>>> Regards,
>>> Markus
>>> 
>>> Op di 27 sep. 2022 om 11:55 schreef Miguel Joy
>>> <Miguel.Joy@amexgbt.com.invalid <ma...@amexgbt.com.invalid>>:
>>> 
>>>> Hi all,
>>>> 
>>>> I'm new to Solr and recently inherited a Solr application (version
>>>> 5.4) from a previous developer with very little documentation.  At
>>>> any rate, my problem is this:
>>>> 
>>>> I have some email addresses that are stored as mixed case.
>>>> 
>>>> Tom.Jones@acme.com
>>>> <ma...@acme.com><mailto:Tom.Jones@acme.com
>>>> <ma...@acme.com>> = Success [querying for this email
>>>> address and passing in the full email address in any case [upper or
>>>> lower] returns the correct result]
>>>> 
>>>> Kevin.McNeil@acme.com
>>>> <ma...@acme.com><mailto:Kevin.McNeil@acme.com
>>>> <ma...@acme.com>> = Fail [querying for this email
>>>> address and passing in the full email address in any case [upper or
>>>> lower] returns zero results]
>>>> 
>>>> And here's the fieldType definition that's used for email addresses:
>>>> 
>>>> <fieldType name="text_phonetic" class="solr.TextField"
>>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>>>     <analyzer type="index">
>>>>       <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>       <filter class="solr.StopFilterFactory"
>>>>               ignoreCase="true"
>>>>               words="stopwords.txt"
>>>>               />
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
>>>> splitOnNumerics="0"/>
>>>>               <filter class="solr.PhoneticFilterFactory"
>>>> encoder="Caverphone" inject="true"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>       <filter class="solr.KeywordMarkerFilterFactory"
>>>> protected="protwords.txt"/>
>>>>       <filter class="solr.PorterStemFilterFactory"/>
>>>>     </analyzer>
>>>>     <analyzer type="query">
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>                               <filter class="solr.SynonymFilterFactory"
>>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>>>       <filter class="solr.StopFilterFactory"
>>>>               ignoreCase="true"
>>>>               words="stopwords.txt"
>>>>               />
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
>>>> splitOnNumerics="0"/>
>>>>                               <filter
>>> class="solr.PhoneticFilterFactory"
>>>> encoder="Caverphone" inject="true"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>       <filter class="solr.KeywordMarkerFilterFactory"
>>>> protected="protwords.txt"/>
>>>>               <filter class="solr.PorterStemFilterFactory"/>
>>>>     </analyzer>
>>>>   </fieldType>
>>>> 
>>>> I've spent a couple days researching this issue, and my best guess
>>>> at a fix would be to re-index this data using the
>>>> LowerCaseFilterFatory so that all email addresses are stored in
>>>> lower case, but that would be a significant change as I have over 10
>>>> million docs indexed.  In addition, its strange that we get search
>>>> results on some mixed case email addresses, but not all, so I'm
>>>> hoping that maybe all we need is to tweak the query analyzer?
>>>> Thanks in advance for your help with this question.  Please let me know if you need any additional details.
>>>> 
>>>> -Miguel
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> 
>>>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
>>>> sublicensees (including Ovation Travel Group and Egencia) use
>>>> certain trademarks and service marks of American Express Company or
>>>> its subsidiaries (American Express) in the 'American Express Global
>>>> Business Travel' and 'American Express Meetings & Events' brands and
>>>> in connection with its business for permitted uses only under a
>>>> limited licence from American Express (Licensed Marks). The Licensed
>>>> Marks are trademarks or service marks of, and the property of,
>>>> American Express. GBT UK is a subsidiary of Global Business Travel
>>>> Group, Inc. (NYSE: GBTG). American Express holds a minority interest
>>>> in GBTG, which operates as a separate company from American Express.
>>>> 
>>>> ________________________________
>>>> 
>>>> This email message and all attachments transmitted with it are
>>>> solely for the use of the intended recipient(s) and may contain
>>>> confidential and/or privileged information. If the reader of this
>>>> message is not the intended recipient, you are hereby notified that
>>>> any dissemination, distribution, copying and/or other use of this
>>>> message or its attachments is strictly prohibited. If you have
>>>> received this message in error, please notify the sender and delete it immediately.
>>>> Unintended transmission shall not constitute a waiver of the
>>> attorney-client or any other privilege.
>>>> 
>>>> ________________________________
>>>> Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de
>>>> sous-licence autoris?s (notamment Ovation Travel Group et Egencia)
>>>> utilise certaines marques commerciales et marques de services
>>>> d'American Express Company ou de ses filiales (American Express)
>>>> dans les marques < American Express Global Business Travel > et <
>>>> American Express Meetings & Events > ainsi qu'en lien avec son
>>>> activit?, ? des fins autoris?es uniquement, sous une licence limit?e
>>>> accord?e par
>>> American Express (marques sous licence).
>>>> Les marques sous licence sont des marques commerciales ou des
>>>> marques de services d'American Express, dont elles sont la
>>>> propri?t?. GBT UK est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
>>>> American Express d?tient une participation minoritaire dans GBTG,
>>>> qui op?re en tant que soci?t? distincte d'American Express.
>>>> 
>>>> ________________________________
>>>> 
>>>> Ce message ?lectronique et toutes les pi?ces jointes transmises avec
>>>> celui-ci sont uniquement destin?s ? l'usage du ou des destinataires
>>>> vis?s et peuvent contenir des informations confidentielles et/ou
>>>> privil?gi?es. Si le lecteur de ce message n'est pas le destinataire
>>> pr?vu, vous ?tes inform?
>>>> par la pr?sente que toute diffusion, distribution, copie et/ou autre
>>>> utilisation de ce message ou de ses pi?ces jointes est strictement
>>>> interdite. Si vous avez re?u ce message par erreur, veuillez en
>>>> informer l'exp?diteur et le supprimer imm?diatement. Une
>>>> transmission involontaire ne constitue pas une renonciation au
>>>> secret professionnel ou ? toute autre pr?rogative.
>>>> 
>>>> ________________________________
>>>> 
>>> 
>>> 
>>> ________________________________
>>> 
>>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
>>> sublicensees (including Ovation Travel Group and Egencia) use certain
>>> trademarks and service marks of American Express Company or its
>>> subsidiaries (American Express) in the 'American Express Global
>>> Business Travel' and 'American Express Meetings & Events' brands and
>>> in connection with its business for permitted uses only under a
>>> limited licence from American Express (Licensed Marks). The Licensed
>>> Marks are trademarks or service marks of, and the property of,
>>> American Express. GBT UK is a subsidiary of Global Business Travel
>>> Group, Inc. (NYSE: GBTG). American Express holds a minority interest
>>> in GBTG, which operates as a separate company from American Express.
>>> 
>>> ________________________________
>>> 
>>> This email message and all attachments transmitted with it are solely
>>> for the use of the intended recipient(s) and may contain confidential
>>> and/or privileged information. If the reader of this message is not
>>> the intended recipient, you are hereby notified that any
>>> dissemination, distribution, copying and/or other use of this message
>>> or its attachments is strictly prohibited. If you have received this
>>> message in error, please notify the sender and delete it immediately.
>>> Unintended transmission shall not constitute a waiver of the attorney-client or any other privilege.
>>> 
>>> ________________________________
>>> Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de
>>> sous-licence autorisés (notamment Ovation Travel Group et Egencia)
>>> utilise certaines marques commerciales et marques de services
>>> d’American Express Company ou de ses filiales (American Express) dans
>>> les marques « American Express Global Business Travel » et « American
>>> Express Meetings & Events » ainsi qu’en lien avec son activité, à des
>>> fins autorisées uniquement, sous une licence limitée accordée par American Express (marques sous licence).
>>> Les marques sous licence sont des marques commerciales ou des marques
>>> de services d’American Express, dont elles sont la propriété. GBT UK
>>> est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
>>> American Express détient une participation minoritaire dans GBTG, qui
>>> opère en tant que société distincte d’American Express.
>>> 
>>> ________________________________
>>> 
>>> Ce message électronique et toutes les pièces jointes transmises avec
>>> celui-ci sont uniquement destinés à l’usage du ou des destinataires
>>> visés et peuvent contenir des informations confidentielles et/ou
>>> privilégiées. Si le lecteur de ce message n’est pas le destinataire
>>> prévu, vous êtes informé par la présente que toute diffusion,
>>> distribution, copie et/ou autre utilisation de ce message ou de ses
>>> pièces jointes est strictement interdite. Si vous avez reçu ce
>>> message par erreur, veuillez en informer l’expéditeur et le supprimer
>>> immédiatement. Une transmission involontaire ne constitue pas une
>>> renonciation au secret professionnel ou à toute autre prérogative.
>>> 
>>> ________________________________
>>> 
> 
> 
> 
> ________________________________
> 
> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised sublicensees (including Ovation Travel Group and Egencia) use certain trademarks and service marks of American Express Company or its subsidiaries (American Express) in the 'American Express Global Business Travel' and 'American Express Meetings & Events' brands and in connection with its business for permitted uses only under a limited licence from American Express (Licensed Marks). The Licensed Marks are trademarks or service marks of, and the property of, American Express. GBT UK is a subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American Express holds a minority interest in GBTG, which operates as a separate company from American Express.
> 
> ________________________________
> 
> This email message and all attachments transmitted with it are solely for the use of the intended recipient(s) and may contain confidential and/or privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, copying and/or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender and delete it immediately. Unintended transmission shall not constitute a waiver of the attorney-client or any other privilege.
> 
> ________________________________
> Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise certaines marques commerciales et marques de services d’American Express Company ou de ses filiales (American Express) dans les marques « American Express Global Business Travel » et « American Express Meetings & Events » ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous une licence limitée accordée par American Express (marques sous licence). Les marques sous licence sont des marques commerciales ou des marques de services d’American Express, dont elles sont la propriété. GBT UK est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American Express détient une participation minoritaire dans GBTG, qui opère en tant que société distincte d’American Express.
> 
> ________________________________
> 
> Ce message électronique et toutes les pièces jointes transmises avec celui-ci sont uniquement destinés à l’usage du ou des destinataires visés et peuvent contenir des informations confidentielles et/ou privilégiées. Si le lecteur de ce message n’est pas le destinataire prévu, vous êtes informé par la présente que toute diffusion, distribution, copie et/ou autre utilisation de ce message ou de ses pièces jointes est strictement interdite. Si vous avez reçu ce message par erreur, veuillez en informer l’expéditeur et le supprimer immédiatement. Une transmission involontaire ne constitue pas une renonciation au secret professionnel ou à toute autre prérogative.
> 
> ________________________________


RE: Solr Search - Mixed Case Issue

Posted by Miguel Joy <Mi...@amexgbt.com.INVALID>.
Hi Walter,

Thanks very much for your honest feedback.  As I mentioned, I inherited this application so I've been trying to pick up the pieces as best I can.  The solr analysis tool is great, so it's now clear to me how to make changes to the analysis chain and test them using the analysis tool.  I suspect we'll end up having to clean this configuration up and re-index the documents.  Again, thanks to you and Markus for the support.  I have what I need now.

-Miguel

-----Original Message-----
From: Walter Underwood <wu...@wunderwood.org>
Sent: Tuesday, September 27, 2022 2:25 PM
To: users@solr.apache.org
Subject: Re: Solr Search - Mixed Case Issue

CAUTION: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and expect that the content is safe.

Honestly, this analysis chain is a mess.

* StandardTokenizer has parsing support for email addresses, so that is a better choice.
* Never mix phonetic transformation and stemming, use different chains. Phonetic tokens aren’t stemmable.
* Don’t stem email addresses.
* Don’t do phonetic transforms on email addresses unless you really want that.
* Don’t remove stopwords ever, but especially for email addresses.
* Don’t do word delimiter splitting on email addresses unless you really want that.

For stopwords, let’s assume that “in” is in stopwords.txt. That means it corrupts every email address from India.

Instead, use a chain that looks like this. You shouldn’t need separate index and query chains.

* StandardTokenizerFactory
* LowercaseFilterFactory

Using HTMLStripCharFilterFactory for preprocessing probably doesn’t hurt, but shouldn’t be necessary. If someone is using “&gt;” in your content or queries, things are a little weird.

I do like to use Unicode normalization to take care of stuff like curly quotes. That also has more complete lowercasing support. You’ll probably need to include the ICU libraries.

<filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose”/>

To test, make an analysis chain like this, then use the analysis tool in the UI to see if it does what you want. If it does that, then you can reindex.

wunder
Walter Underwood
wunder@wunderwood.org <ma...@wunderwood.org>
https://urldefense.com/v3/__http://observer.wunderwood.org/__;!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE3lpDPnG0$    (my blog)

> On Sep 27, 2022, at 8:06 AM, Markus Jelsma <markus.jelsma@openindex.io <ma...@openindex.io>> wrote:
>
> Hello Miguel,
>
> That's likely due to catenateAll/catenateWords. McNeil is first split
> so you can find it using 'mc neil', but not 'mcneil'. Using the
> catenate*-settings the split terms 'McNeil' into 'mc' 'neil' can
> become 'mcneil' again.
>
> If you haven't already, use Solr's analysis GUI [1] for testing these
> configurations. It shows step by step what becomes of the index- and
> query-time analysis chains, and if they match up in the end.
>
> Regards,
> Markus
>
> [1] https://urldefense.com/v3/__http://localhost:8983/solr/*/COLLECTION/analysis__;Iw!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE35nq4Sgo$   <https://urldefense.com/v3/__http://localhost:8983/solr/*/COLLECTION/analysis__;Iw!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE35nq4Sgo$  >
>
> Op di 27 sep. 2022 om 16:54 schreef Miguel Joy
> <Miguel.Joy@amexgbt.com.invalid <ma...@amexgbt.com.invalid>>:
>
>> Hi Markus,
>>
>> Thanks so much for your recommendations.  Matching the
>> splitOnCaseChange attributes  index-time with the query-time, partially fixed our issue.
>> Now, if I search for Kevin.McNeil@acme.com
>> <ma...@acme.com> and provide the exact same case as the
>> email is stored I get a successful result!  However, if I search using kevin.mcneil@acme.com <ma...@acme.com> (all lower-case), it doesn't match.
>> Essentially, only if I search using the exact same case as the email
>> is stored do I get results.  Any additional ideas on how I can get
>> the email search to fully work?  Thanks again for your help.
>>
>> -Miguel
>>
>>
>>
>> -----Original Message-----
>> From: Miguel Joy
>> Sent: Tuesday, September 27, 2022 6:43 AM
>> To: users@solr.apache.org <ma...@solr.apache.org>
>> Subject: RE: Solr Search - Mixed Case Issue
>>
>> Hi Markus,
>>
>> Thanks for your prompt reply to my issue.  I will try your
>> suggestions and report back.
>>
>> Thanks,
>> -Miguel
>>
>> -----Original Message-----
>> From: Markus Jelsma <markus.jelsma@openindex.io
>> <ma...@openindex.io>>
>> Sent: Tuesday, September 27, 2022 6:36 AM
>> To: users@solr.apache.org <ma...@solr.apache.org>
>> Subject: Re: Solr Search - Mixed Case Issue
>>
>> CAUTION: This email originated from outside the organization. Do not
>> click links or open attachments unless you recognize the sender and
>> expect that the content is safe.
>>
>> Hello Miguel,
>>
>> The problem lies with the different index-time and query-time
>> WordDelimiterFilter configurations.
>>
>>> In addition, its strange that we get search results on some mixed
>>> case
>> email addresses
>>
>> Yes, precisely!
>>
>> See the splitOnCaseChange attributes, that is where the problem is.
>> In your case you should be able to copy the index-time configuration
>> to the query-time and get rid of the problem without reindex. It
>> 'should' solve the problem. If not, try to enable catenateAll, on
>> both sides, but that requires reindex.
>>
>> Ideally you should probably also get rid of the StopFilterFactory,
>> unless very well configured (which i do not suspect) it will cause
>> additional weird problems. This does require reindexing.
>>
>> Regards,
>> Markus
>>
>> Op di 27 sep. 2022 om 11:55 schreef Miguel Joy
>> <Miguel.Joy@amexgbt.com.invalid <ma...@amexgbt.com.invalid>>:
>>
>>> Hi all,
>>>
>>> I'm new to Solr and recently inherited a Solr application (version
>>> 5.4) from a previous developer with very little documentation.  At
>>> any rate, my problem is this:
>>>
>>> I have some email addresses that are stored as mixed case.
>>>
>>> Tom.Jones@acme.com
>>> <ma...@acme.com><mailto:Tom.Jones@acme.com
>>> <ma...@acme.com>> = Success [querying for this email
>>> address and passing in the full email address in any case [upper or
>>> lower] returns the correct result]
>>>
>>> Kevin.McNeil@acme.com
>>> <ma...@acme.com><mailto:Kevin.McNeil@acme.com
>>> <ma...@acme.com>> = Fail [querying for this email
>>> address and passing in the full email address in any case [upper or
>>> lower] returns zero results]
>>>
>>> And here's the fieldType definition that's used for email addresses:
>>>
>>> <fieldType name="text_phonetic" class="solr.TextField"
>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>>      <analyzer type="index">
>>>        <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>        <filter class="solr.StopFilterFactory"
>>>                ignoreCase="true"
>>>                words="stopwords.txt"
>>>                />
>>>        <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
>>> splitOnNumerics="0"/>
>>>                <filter class="solr.PhoneticFilterFactory"
>>> encoder="Caverphone" inject="true"/>
>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>        <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>        <filter class="solr.PorterStemFilterFactory"/>
>>>      </analyzer>
>>>      <analyzer type="query">
>>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>                                <filter class="solr.SynonymFilterFactory"
>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>>        <filter class="solr.StopFilterFactory"
>>>                ignoreCase="true"
>>>                words="stopwords.txt"
>>>                />
>>>        <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
>>> splitOnNumerics="0"/>
>>>                                <filter
>> class="solr.PhoneticFilterFactory"
>>> encoder="Caverphone" inject="true"/>
>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>        <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>                <filter class="solr.PorterStemFilterFactory"/>
>>>      </analyzer>
>>>    </fieldType>
>>>
>>> I've spent a couple days researching this issue, and my best guess
>>> at a fix would be to re-index this data using the
>>> LowerCaseFilterFatory so that all email addresses are stored in
>>> lower case, but that would be a significant change as I have over 10
>>> million docs indexed.  In addition, its strange that we get search
>>> results on some mixed case email addresses, but not all, so I'm
>>> hoping that maybe all we need is to tweak the query analyzer?
>>> Thanks in advance for your help with this question.  Please let me know if you need any additional details.
>>>
>>> -Miguel
>>>
>>>
>>>
>>> ________________________________
>>>
>>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
>>> sublicensees (including Ovation Travel Group and Egencia) use
>>> certain trademarks and service marks of American Express Company or
>>> its subsidiaries (American Express) in the 'American Express Global
>>> Business Travel' and 'American Express Meetings & Events' brands and
>>> in connection with its business for permitted uses only under a
>>> limited licence from American Express (Licensed Marks). The Licensed
>>> Marks are trademarks or service marks of, and the property of,
>>> American Express. GBT UK is a subsidiary of Global Business Travel
>>> Group, Inc. (NYSE: GBTG). American Express holds a minority interest
>>> in GBTG, which operates as a separate company from American Express.
>>>
>>> ________________________________
>>>
>>> This email message and all attachments transmitted with it are
>>> solely for the use of the intended recipient(s) and may contain
>>> confidential and/or privileged information. If the reader of this
>>> message is not the intended recipient, you are hereby notified that
>>> any dissemination, distribution, copying and/or other use of this
>>> message or its attachments is strictly prohibited. If you have
>>> received this message in error, please notify the sender and delete it immediately.
>>> Unintended transmission shall not constitute a waiver of the
>> attorney-client or any other privilege.
>>>
>>> ________________________________
>>> Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de
>>> sous-licence autoris?s (notamment Ovation Travel Group et Egencia)
>>> utilise certaines marques commerciales et marques de services
>>> d'American Express Company ou de ses filiales (American Express)
>>> dans les marques < American Express Global Business Travel > et <
>>> American Express Meetings & Events > ainsi qu'en lien avec son
>>> activit?, ? des fins autoris?es uniquement, sous une licence limit?e
>>> accord?e par
>> American Express (marques sous licence).
>>> Les marques sous licence sont des marques commerciales ou des
>>> marques de services d'American Express, dont elles sont la
>>> propri?t?. GBT UK est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
>>> American Express d?tient une participation minoritaire dans GBTG,
>>> qui op?re en tant que soci?t? distincte d'American Express.
>>>
>>> ________________________________
>>>
>>> Ce message ?lectronique et toutes les pi?ces jointes transmises avec
>>> celui-ci sont uniquement destin?s ? l'usage du ou des destinataires
>>> vis?s et peuvent contenir des informations confidentielles et/ou
>>> privil?gi?es. Si le lecteur de ce message n'est pas le destinataire
>> pr?vu, vous ?tes inform?
>>> par la pr?sente que toute diffusion, distribution, copie et/ou autre
>>> utilisation de ce message ou de ses pi?ces jointes est strictement
>>> interdite. Si vous avez re?u ce message par erreur, veuillez en
>>> informer l'exp?diteur et le supprimer imm?diatement. Une
>>> transmission involontaire ne constitue pas une renonciation au
>>> secret professionnel ou ? toute autre pr?rogative.
>>>
>>> ________________________________
>>>
>>
>>
>> ________________________________
>>
>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
>> sublicensees (including Ovation Travel Group and Egencia) use certain
>> trademarks and service marks of American Express Company or its
>> subsidiaries (American Express) in the 'American Express Global
>> Business Travel' and 'American Express Meetings & Events' brands and
>> in connection with its business for permitted uses only under a
>> limited licence from American Express (Licensed Marks). The Licensed
>> Marks are trademarks or service marks of, and the property of,
>> American Express. GBT UK is a subsidiary of Global Business Travel
>> Group, Inc. (NYSE: GBTG). American Express holds a minority interest
>> in GBTG, which operates as a separate company from American Express.
>>
>> ________________________________
>>
>> This email message and all attachments transmitted with it are solely
>> for the use of the intended recipient(s) and may contain confidential
>> and/or privileged information. If the reader of this message is not
>> the intended recipient, you are hereby notified that any
>> dissemination, distribution, copying and/or other use of this message
>> or its attachments is strictly prohibited. If you have received this
>> message in error, please notify the sender and delete it immediately.
>> Unintended transmission shall not constitute a waiver of the attorney-client or any other privilege.
>>
>> ________________________________
>> Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de
>> sous-licence autorisés (notamment Ovation Travel Group et Egencia)
>> utilise certaines marques commerciales et marques de services
>> d’American Express Company ou de ses filiales (American Express) dans
>> les marques « American Express Global Business Travel » et « American
>> Express Meetings & Events » ainsi qu’en lien avec son activité, à des
>> fins autorisées uniquement, sous une licence limitée accordée par American Express (marques sous licence).
>> Les marques sous licence sont des marques commerciales ou des marques
>> de services d’American Express, dont elles sont la propriété. GBT UK
>> est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
>> American Express détient une participation minoritaire dans GBTG, qui
>> opère en tant que société distincte d’American Express.
>>
>> ________________________________
>>
>> Ce message électronique et toutes les pièces jointes transmises avec
>> celui-ci sont uniquement destinés à l’usage du ou des destinataires
>> visés et peuvent contenir des informations confidentielles et/ou
>> privilégiées. Si le lecteur de ce message n’est pas le destinataire
>> prévu, vous êtes informé par la présente que toute diffusion,
>> distribution, copie et/ou autre utilisation de ce message ou de ses
>> pièces jointes est strictement interdite. Si vous avez reçu ce
>> message par erreur, veuillez en informer l’expéditeur et le supprimer
>> immédiatement. Une transmission involontaire ne constitue pas une
>> renonciation au secret professionnel ou à toute autre prérogative.
>>
>> ________________________________
>>



________________________________

Notice: GBT Travel Services UK Limited (GBT UK) and its authorised sublicensees (including Ovation Travel Group and Egencia) use certain trademarks and service marks of American Express Company or its subsidiaries (American Express) in the 'American Express Global Business Travel' and 'American Express Meetings & Events' brands and in connection with its business for permitted uses only under a limited licence from American Express (Licensed Marks). The Licensed Marks are trademarks or service marks of, and the property of, American Express. GBT UK is a subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American Express holds a minority interest in GBTG, which operates as a separate company from American Express.

________________________________

This email message and all attachments transmitted with it are solely for the use of the intended recipient(s) and may contain confidential and/or privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, copying and/or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender and delete it immediately. Unintended transmission shall not constitute a waiver of the attorney-client or any other privilege.

________________________________
Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise certaines marques commerciales et marques de services d’American Express Company ou de ses filiales (American Express) dans les marques « American Express Global Business Travel » et « American Express Meetings & Events » ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous une licence limitée accordée par American Express (marques sous licence). Les marques sous licence sont des marques commerciales ou des marques de services d’American Express, dont elles sont la propriété. GBT UK est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American Express détient une participation minoritaire dans GBTG, qui opère en tant que société distincte d’American Express.

________________________________

Ce message électronique et toutes les pièces jointes transmises avec celui-ci sont uniquement destinés à l’usage du ou des destinataires visés et peuvent contenir des informations confidentielles et/ou privilégiées. Si le lecteur de ce message n’est pas le destinataire prévu, vous êtes informé par la présente que toute diffusion, distribution, copie et/ou autre utilisation de ce message ou de ses pièces jointes est strictement interdite. Si vous avez reçu ce message par erreur, veuillez en informer l’expéditeur et le supprimer immédiatement. Une transmission involontaire ne constitue pas une renonciation au secret professionnel ou à toute autre prérogative.

________________________________

Re: Solr Search - Mixed Case Issue

Posted by Walter Underwood <wu...@wunderwood.org>.
Honestly, this analysis chain is a mess.

* StandardTokenizer has parsing support for email addresses, so that is a better choice.
* Never mix phonetic transformation and stemming, use different chains. Phonetic tokens aren’t stemmable.
* Don’t stem email addresses.
* Don’t do phonetic transforms on email addresses unless you really want that.
* Don’t remove stopwords ever, but especially for email addresses.
* Don’t do word delimiter splitting on email addresses unless you really want that.

For stopwords, let’s assume that “in” is in stopwords.txt. That means it corrupts every email address from India.

Instead, use a chain that looks like this. You shouldn’t need separate index and query chains.

* StandardTokenizerFactory
* LowercaseFilterFactory

Using HTMLStripCharFilterFactory for preprocessing probably doesn’t hurt, but shouldn’t be necessary. If someone is using “&gt;” in your content or queries, things are a little weird.

I do like to use Unicode normalization to take care of stuff like curly quotes. That also has more complete lowercasing support. You’ll probably need to include the ICU libraries.

<filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose”/>

To test, make an analysis chain like this, then use the analysis tool in the UI to see if it does what you want. If it does that, then you can reindex.

wunder
Walter Underwood
wunder@wunderwood.org <ma...@wunderwood.org>
http://observer.wunderwood.org/  (my blog)

> On Sep 27, 2022, at 8:06 AM, Markus Jelsma <markus.jelsma@openindex.io <ma...@openindex.io>> wrote:
> 
> Hello Miguel,
> 
> That's likely due to catenateAll/catenateWords. McNeil is first split so
> you can find it using 'mc neil', but not 'mcneil'. Using the
> catenate*-settings the split terms 'McNeil' into 'mc' 'neil' can become
> 'mcneil' again.
> 
> If you haven't already, use Solr's analysis GUI [1] for testing these
> configurations. It shows step by step what becomes of the index- and
> query-time analysis chains, and if they match up in the end.
> 
> Regards,
> Markus
> 
> [1] http://localhost:8983/solr/#/COLLECTION/analysis <http://localhost:8983/solr/#/COLLECTION/analysis>
> 
> Op di 27 sep. 2022 om 16:54 schreef Miguel Joy
> <Miguel.Joy@amexgbt.com.invalid <ma...@amexgbt.com.invalid>>:
> 
>> Hi Markus,
>> 
>> Thanks so much for your recommendations.  Matching the splitOnCaseChange
>> attributes  index-time with the query-time, partially fixed our issue.
>> Now, if I search for Kevin.McNeil@acme.com <ma...@acme.com> and provide the exact same
>> case as the email is stored I get a successful result!  However, if I
>> search using kevin.mcneil@acme.com <ma...@acme.com> (all lower-case), it doesn't match.
>> Essentially, only if I search using the exact same case as the email is
>> stored do I get results.  Any additional ideas on how I can get the email
>> search to fully work?  Thanks again for your help.
>> 
>> -Miguel
>> 
>> 
>> 
>> -----Original Message-----
>> From: Miguel Joy
>> Sent: Tuesday, September 27, 2022 6:43 AM
>> To: users@solr.apache.org <ma...@solr.apache.org>
>> Subject: RE: Solr Search - Mixed Case Issue
>> 
>> Hi Markus,
>> 
>> Thanks for your prompt reply to my issue.  I will try your suggestions and
>> report back.
>> 
>> Thanks,
>> -Miguel
>> 
>> -----Original Message-----
>> From: Markus Jelsma <markus.jelsma@openindex.io <ma...@openindex.io>>
>> Sent: Tuesday, September 27, 2022 6:36 AM
>> To: users@solr.apache.org <ma...@solr.apache.org>
>> Subject: Re: Solr Search - Mixed Case Issue
>> 
>> CAUTION: This email originated from outside the organization. Do not click
>> links or open attachments unless you recognize the sender and expect that
>> the content is safe.
>> 
>> Hello Miguel,
>> 
>> The problem lies with the different index-time and query-time
>> WordDelimiterFilter configurations.
>> 
>>> In addition, its strange that we get search results on some mixed case
>> email addresses
>> 
>> Yes, precisely!
>> 
>> See the splitOnCaseChange attributes, that is where the problem is. In
>> your case you should be able to copy the index-time configuration to the
>> query-time and get rid of the problem without reindex. It 'should' solve
>> the problem. If not, try to enable catenateAll, on both sides, but that
>> requires reindex.
>> 
>> Ideally you should probably also get rid of the StopFilterFactory, unless
>> very well configured (which i do not suspect) it will cause additional
>> weird problems. This does require reindexing.
>> 
>> Regards,
>> Markus
>> 
>> Op di 27 sep. 2022 om 11:55 schreef Miguel Joy
>> <Miguel.Joy@amexgbt.com.invalid <ma...@amexgbt.com.invalid>>:
>> 
>>> Hi all,
>>> 
>>> I'm new to Solr and recently inherited a Solr application (version
>>> 5.4) from a previous developer with very little documentation.  At any
>>> rate, my problem is this:
>>> 
>>> I have some email addresses that are stored as mixed case.
>>> 
>>> Tom.Jones@acme.com <ma...@acme.com><mailto:Tom.Jones@acme.com <ma...@acme.com>> = Success [querying for
>>> this email address and passing in the full email address in any case
>>> [upper or lower] returns the correct result]
>>> 
>>> Kevin.McNeil@acme.com <ma...@acme.com><mailto:Kevin.McNeil@acme.com <ma...@acme.com>> = Fail [querying
>>> for this email address and passing in the full email address in any
>>> case [upper or lower] returns zero results]
>>> 
>>> And here's the fieldType definition that's used for email addresses:
>>> 
>>> <fieldType name="text_phonetic" class="solr.TextField"
>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>>      <analyzer type="index">
>>>        <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>        <filter class="solr.StopFilterFactory"
>>>                ignoreCase="true"
>>>                words="stopwords.txt"
>>>                />
>>>        <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
>>> splitOnNumerics="0"/>
>>>                <filter class="solr.PhoneticFilterFactory"
>>> encoder="Caverphone" inject="true"/>
>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>        <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>        <filter class="solr.PorterStemFilterFactory"/>
>>>      </analyzer>
>>>      <analyzer type="query">
>>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>                                <filter class="solr.SynonymFilterFactory"
>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>>        <filter class="solr.StopFilterFactory"
>>>                ignoreCase="true"
>>>                words="stopwords.txt"
>>>                />
>>>        <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
>>> splitOnNumerics="0"/>
>>>                                <filter
>> class="solr.PhoneticFilterFactory"
>>> encoder="Caverphone" inject="true"/>
>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>        <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>                <filter class="solr.PorterStemFilterFactory"/>
>>>      </analyzer>
>>>    </fieldType>
>>> 
>>> I've spent a couple days researching this issue, and my best guess at
>>> a fix would be to re-index this data using the LowerCaseFilterFatory
>>> so that all email addresses are stored in lower case, but that would
>>> be a significant change as I have over 10 million docs indexed.  In
>>> addition, its strange that we get search results on some mixed case
>>> email addresses, but not all, so I'm hoping that maybe all we need is
>>> to tweak the query analyzer?  Thanks in advance for your help with
>>> this question.  Please let me know if you need any additional details.
>>> 
>>> -Miguel
>>> 
>>> 
>>> 
>>> ________________________________
>>> 
>>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
>>> sublicensees (including Ovation Travel Group and Egencia) use certain
>>> trademarks and service marks of American Express Company or its
>>> subsidiaries (American Express) in the 'American Express Global
>>> Business Travel' and 'American Express Meetings & Events' brands and
>>> in connection with its business for permitted uses only under a
>>> limited licence from American Express (Licensed Marks). The Licensed
>>> Marks are trademarks or service marks of, and the property of,
>>> American Express. GBT UK is a subsidiary of Global Business Travel
>>> Group, Inc. (NYSE: GBTG). American Express holds a minority interest
>>> in GBTG, which operates as a separate company from American Express.
>>> 
>>> ________________________________
>>> 
>>> This email message and all attachments transmitted with it are solely
>>> for the use of the intended recipient(s) and may contain confidential
>>> and/or privileged information. If the reader of this message is not
>>> the intended recipient, you are hereby notified that any
>>> dissemination, distribution, copying and/or other use of this message
>>> or its attachments is strictly prohibited. If you have received this
>>> message in error, please notify the sender and delete it immediately.
>>> Unintended transmission shall not constitute a waiver of the
>> attorney-client or any other privilege.
>>> 
>>> ________________________________
>>> Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de
>>> sous-licence autoris?s (notamment Ovation Travel Group et Egencia)
>>> utilise certaines marques commerciales et marques de services
>>> d'American Express Company ou de ses filiales (American Express) dans
>>> les marques < American Express Global Business Travel > et < American
>>> Express Meetings & Events > ainsi qu'en lien avec son activit?, ? des
>>> fins autoris?es uniquement, sous une licence limit?e accord?e par
>> American Express (marques sous licence).
>>> Les marques sous licence sont des marques commerciales ou des marques
>>> de services d'American Express, dont elles sont la propri?t?. GBT UK
>>> est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
>>> American Express d?tient une participation minoritaire dans GBTG, qui
>>> op?re en tant que soci?t? distincte d'American Express.
>>> 
>>> ________________________________
>>> 
>>> Ce message ?lectronique et toutes les pi?ces jointes transmises avec
>>> celui-ci sont uniquement destin?s ? l'usage du ou des destinataires
>>> vis?s et peuvent contenir des informations confidentielles et/ou
>>> privil?gi?es. Si le lecteur de ce message n'est pas le destinataire
>> pr?vu, vous ?tes inform?
>>> par la pr?sente que toute diffusion, distribution, copie et/ou autre
>>> utilisation de ce message ou de ses pi?ces jointes est strictement
>>> interdite. Si vous avez re?u ce message par erreur, veuillez en
>>> informer l'exp?diteur et le supprimer imm?diatement. Une transmission
>>> involontaire ne constitue pas une renonciation au secret professionnel
>>> ou ? toute autre pr?rogative.
>>> 
>>> ________________________________
>>> 
>> 
>> 
>> ________________________________
>> 
>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
>> sublicensees (including Ovation Travel Group and Egencia) use certain
>> trademarks and service marks of American Express Company or its
>> subsidiaries (American Express) in the 'American Express Global Business
>> Travel' and 'American Express Meetings & Events' brands and in connection
>> with its business for permitted uses only under a limited licence from
>> American Express (Licensed Marks). The Licensed Marks are trademarks or
>> service marks of, and the property of, American Express. GBT UK is a
>> subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American
>> Express holds a minority interest in GBTG, which operates as a separate
>> company from American Express.
>> 
>> ________________________________
>> 
>> This email message and all attachments transmitted with it are solely for
>> the use of the intended recipient(s) and may contain confidential and/or
>> privileged information. If the reader of this message is not the intended
>> recipient, you are hereby notified that any dissemination, distribution,
>> copying and/or other use of this message or its attachments is strictly
>> prohibited. If you have received this message in error, please notify the
>> sender and delete it immediately. Unintended transmission shall not
>> constitute a waiver of the attorney-client or any other privilege.
>> 
>> ________________________________
>> Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de
>> sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise
>> certaines marques commerciales et marques de services d’American Express
>> Company ou de ses filiales (American Express) dans les marques « American
>> Express Global Business Travel » et « American Express Meetings & Events »
>> ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous
>> une licence limitée accordée par American Express (marques sous licence).
>> Les marques sous licence sont des marques commerciales ou des marques de
>> services d’American Express, dont elles sont la propriété. GBT UK est une
>> filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American
>> Express détient une participation minoritaire dans GBTG, qui opère en tant
>> que société distincte d’American Express.
>> 
>> ________________________________
>> 
>> Ce message électronique et toutes les pièces jointes transmises avec
>> celui-ci sont uniquement destinés à l’usage du ou des destinataires visés
>> et peuvent contenir des informations confidentielles et/ou privilégiées. Si
>> le lecteur de ce message n’est pas le destinataire prévu, vous êtes informé
>> par la présente que toute diffusion, distribution, copie et/ou autre
>> utilisation de ce message ou de ses pièces jointes est strictement
>> interdite. Si vous avez reçu ce message par erreur, veuillez en informer
>> l’expéditeur et le supprimer immédiatement. Une transmission involontaire
>> ne constitue pas une renonciation au secret professionnel ou à toute autre
>> prérogative.
>> 
>> ________________________________
>> 


Re: Solr Search - Mixed Case Issue

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Miguel,

That's likely due to catenateAll/catenateWords. McNeil is first split so
you can find it using 'mc neil', but not 'mcneil'. Using the
catenate*-settings the split terms 'McNeil' into 'mc' 'neil' can become
'mcneil' again.

If you haven't already, use Solr's analysis GUI [1] for testing these
configurations. It shows step by step what becomes of the index- and
query-time analysis chains, and if they match up in the end.

Regards,
Markus

[1] http://localhost:8983/solr/#/COLLECTION/analysis

Op di 27 sep. 2022 om 16:54 schreef Miguel Joy
<Mi...@amexgbt.com.invalid>:

> Hi Markus,
>
> Thanks so much for your recommendations.  Matching the splitOnCaseChange
> attributes  index-time with the query-time, partially fixed our issue.
> Now, if I search for Kevin.McNeil@acme.com and provide the exact same
> case as the email is stored I get a successful result!  However, if I
> search using kevin.mcneil@acme.com (all lower-case), it doesn't match.
> Essentially, only if I search using the exact same case as the email is
> stored do I get results.  Any additional ideas on how I can get the email
> search to fully work?  Thanks again for your help.
>
> -Miguel
>
>
>
> -----Original Message-----
> From: Miguel Joy
> Sent: Tuesday, September 27, 2022 6:43 AM
> To: users@solr.apache.org
> Subject: RE: Solr Search - Mixed Case Issue
>
> Hi Markus,
>
> Thanks for your prompt reply to my issue.  I will try your suggestions and
> report back.
>
> Thanks,
> -Miguel
>
> -----Original Message-----
> From: Markus Jelsma <ma...@openindex.io>
> Sent: Tuesday, September 27, 2022 6:36 AM
> To: users@solr.apache.org
> Subject: Re: Solr Search - Mixed Case Issue
>
> CAUTION: This email originated from outside the organization. Do not click
> links or open attachments unless you recognize the sender and expect that
> the content is safe.
>
> Hello Miguel,
>
> The problem lies with the different index-time and query-time
> WordDelimiterFilter configurations.
>
> > In addition, its strange that we get search results on some mixed case
> email addresses
>
> Yes, precisely!
>
> See the splitOnCaseChange attributes, that is where the problem is. In
> your case you should be able to copy the index-time configuration to the
> query-time and get rid of the problem without reindex. It 'should' solve
> the problem. If not, try to enable catenateAll, on both sides, but that
> requires reindex.
>
> Ideally you should probably also get rid of the StopFilterFactory, unless
> very well configured (which i do not suspect) it will cause additional
> weird problems. This does require reindexing.
>
> Regards,
> Markus
>
> Op di 27 sep. 2022 om 11:55 schreef Miguel Joy
> <Mi...@amexgbt.com.invalid>:
>
> > Hi all,
> >
> > I'm new to Solr and recently inherited a Solr application (version
> > 5.4) from a previous developer with very little documentation.  At any
> > rate, my problem is this:
> >
> > I have some email addresses that are stored as mixed case.
> >
> > Tom.Jones@acme.com<ma...@acme.com> = Success [querying for
> > this email address and passing in the full email address in any case
> > [upper or lower] returns the correct result]
> >
> > Kevin.McNeil@acme.com<ma...@acme.com> = Fail [querying
> > for this email address and passing in the full email address in any
> > case [upper or lower] returns zero results]
> >
> > And here's the fieldType definition that's used for email addresses:
> >
> > <fieldType name="text_phonetic" class="solr.TextField"
> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >       <analyzer type="index">
> >         <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >         <filter class="solr.StopFilterFactory"
> >                 ignoreCase="true"
> >                 words="stopwords.txt"
> >                 />
> >         <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> > splitOnNumerics="0"/>
> >                 <filter class="solr.PhoneticFilterFactory"
> > encoder="Caverphone" inject="true"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >         <filter class="solr.PorterStemFilterFactory"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                                 <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >         <filter class="solr.StopFilterFactory"
> >                 ignoreCase="true"
> >                 words="stopwords.txt"
> >                 />
> >         <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> > splitOnNumerics="0"/>
> >                                 <filter
> class="solr.PhoneticFilterFactory"
> > encoder="Caverphone" inject="true"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >                 <filter class="solr.PorterStemFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
> >
> > I've spent a couple days researching this issue, and my best guess at
> > a fix would be to re-index this data using the LowerCaseFilterFatory
> > so that all email addresses are stored in lower case, but that would
> > be a significant change as I have over 10 million docs indexed.  In
> > addition, its strange that we get search results on some mixed case
> > email addresses, but not all, so I'm hoping that maybe all we need is
> > to tweak the query analyzer?  Thanks in advance for your help with
> > this question.  Please let me know if you need any additional details.
> >
> > -Miguel
> >
> >
> >
> > ________________________________
> >
> > Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
> > sublicensees (including Ovation Travel Group and Egencia) use certain
> > trademarks and service marks of American Express Company or its
> > subsidiaries (American Express) in the 'American Express Global
> > Business Travel' and 'American Express Meetings & Events' brands and
> > in connection with its business for permitted uses only under a
> > limited licence from American Express (Licensed Marks). The Licensed
> > Marks are trademarks or service marks of, and the property of,
> > American Express. GBT UK is a subsidiary of Global Business Travel
> > Group, Inc. (NYSE: GBTG). American Express holds a minority interest
> > in GBTG, which operates as a separate company from American Express.
> >
> > ________________________________
> >
> > This email message and all attachments transmitted with it are solely
> > for the use of the intended recipient(s) and may contain confidential
> > and/or privileged information. If the reader of this message is not
> > the intended recipient, you are hereby notified that any
> > dissemination, distribution, copying and/or other use of this message
> > or its attachments is strictly prohibited. If you have received this
> > message in error, please notify the sender and delete it immediately.
> > Unintended transmission shall not constitute a waiver of the
> attorney-client or any other privilege.
> >
> > ________________________________
> > Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de
> > sous-licence autoris?s (notamment Ovation Travel Group et Egencia)
> > utilise certaines marques commerciales et marques de services
> > d'American Express Company ou de ses filiales (American Express) dans
> > les marques < American Express Global Business Travel > et < American
> > Express Meetings & Events > ainsi qu'en lien avec son activit?, ? des
> > fins autoris?es uniquement, sous une licence limit?e accord?e par
> American Express (marques sous licence).
> > Les marques sous licence sont des marques commerciales ou des marques
> > de services d'American Express, dont elles sont la propri?t?. GBT UK
> > est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
> > American Express d?tient une participation minoritaire dans GBTG, qui
> > op?re en tant que soci?t? distincte d'American Express.
> >
> > ________________________________
> >
> > Ce message ?lectronique et toutes les pi?ces jointes transmises avec
> > celui-ci sont uniquement destin?s ? l'usage du ou des destinataires
> > vis?s et peuvent contenir des informations confidentielles et/ou
> > privil?gi?es. Si le lecteur de ce message n'est pas le destinataire
> pr?vu, vous ?tes inform?
> > par la pr?sente que toute diffusion, distribution, copie et/ou autre
> > utilisation de ce message ou de ses pi?ces jointes est strictement
> > interdite. Si vous avez re?u ce message par erreur, veuillez en
> > informer l'exp?diteur et le supprimer imm?diatement. Une transmission
> > involontaire ne constitue pas une renonciation au secret professionnel
> > ou ? toute autre pr?rogative.
> >
> > ________________________________
> >
>
>
> ________________________________
>
> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
> sublicensees (including Ovation Travel Group and Egencia) use certain
> trademarks and service marks of American Express Company or its
> subsidiaries (American Express) in the 'American Express Global Business
> Travel' and 'American Express Meetings & Events' brands and in connection
> with its business for permitted uses only under a limited licence from
> American Express (Licensed Marks). The Licensed Marks are trademarks or
> service marks of, and the property of, American Express. GBT UK is a
> subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American
> Express holds a minority interest in GBTG, which operates as a separate
> company from American Express.
>
> ________________________________
>
> This email message and all attachments transmitted with it are solely for
> the use of the intended recipient(s) and may contain confidential and/or
> privileged information. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution,
> copying and/or other use of this message or its attachments is strictly
> prohibited. If you have received this message in error, please notify the
> sender and delete it immediately. Unintended transmission shall not
> constitute a waiver of the attorney-client or any other privilege.
>
> ________________________________
> Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de
> sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise
> certaines marques commerciales et marques de services d’American Express
> Company ou de ses filiales (American Express) dans les marques « American
> Express Global Business Travel » et « American Express Meetings & Events »
> ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous
> une licence limitée accordée par American Express (marques sous licence).
> Les marques sous licence sont des marques commerciales ou des marques de
> services d’American Express, dont elles sont la propriété. GBT UK est une
> filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American
> Express détient une participation minoritaire dans GBTG, qui opère en tant
> que société distincte d’American Express.
>
> ________________________________
>
> Ce message électronique et toutes les pièces jointes transmises avec
> celui-ci sont uniquement destinés à l’usage du ou des destinataires visés
> et peuvent contenir des informations confidentielles et/ou privilégiées. Si
> le lecteur de ce message n’est pas le destinataire prévu, vous êtes informé
> par la présente que toute diffusion, distribution, copie et/ou autre
> utilisation de ce message ou de ses pièces jointes est strictement
> interdite. Si vous avez reçu ce message par erreur, veuillez en informer
> l’expéditeur et le supprimer immédiatement. Une transmission involontaire
> ne constitue pas une renonciation au secret professionnel ou à toute autre
> prérogative.
>
> ________________________________
>

RE: Solr Search - Mixed Case Issue

Posted by Miguel Joy <Mi...@amexgbt.com.INVALID>.
Hi Markus,

Thanks so much for your recommendations.  Matching the splitOnCaseChange attributes  index-time with the query-time, partially fixed our issue.  Now, if I search for Kevin.McNeil@acme.com and provide the exact same case as the email is stored I get a successful result!  However, if I search using kevin.mcneil@acme.com (all lower-case), it doesn't match.  Essentially, only if I search using the exact same case as the email is stored do I get results.  Any additional ideas on how I can get the email search to fully work?  Thanks again for your help.

-Miguel



-----Original Message-----
From: Miguel Joy
Sent: Tuesday, September 27, 2022 6:43 AM
To: users@solr.apache.org
Subject: RE: Solr Search - Mixed Case Issue

Hi Markus,

Thanks for your prompt reply to my issue.  I will try your suggestions and report back.

Thanks,
-Miguel

-----Original Message-----
From: Markus Jelsma <ma...@openindex.io>
Sent: Tuesday, September 27, 2022 6:36 AM
To: users@solr.apache.org
Subject: Re: Solr Search - Mixed Case Issue

CAUTION: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and expect that the content is safe.

Hello Miguel,

The problem lies with the different index-time and query-time WordDelimiterFilter configurations.

> In addition, its strange that we get search results on some mixed case
email addresses

Yes, precisely!

See the splitOnCaseChange attributes, that is where the problem is. In your case you should be able to copy the index-time configuration to the query-time and get rid of the problem without reindex. It 'should' solve the problem. If not, try to enable catenateAll, on both sides, but that requires reindex.

Ideally you should probably also get rid of the StopFilterFactory, unless very well configured (which i do not suspect) it will cause additional weird problems. This does require reindexing.

Regards,
Markus

Op di 27 sep. 2022 om 11:55 schreef Miguel Joy
<Mi...@amexgbt.com.invalid>:

> Hi all,
>
> I'm new to Solr and recently inherited a Solr application (version
> 5.4) from a previous developer with very little documentation.  At any
> rate, my problem is this:
>
> I have some email addresses that are stored as mixed case.
>
> Tom.Jones@acme.com<ma...@acme.com> = Success [querying for
> this email address and passing in the full email address in any case
> [upper or lower] returns the correct result]
>
> Kevin.McNeil@acme.com<ma...@acme.com> = Fail [querying
> for this email address and passing in the full email address in any
> case [upper or lower] returns zero results]
>
> And here's the fieldType definition that's used for email addresses:
>
> <fieldType name="text_phonetic" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer type="index">
>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> splitOnNumerics="0"/>
>                 <filter class="solr.PhoneticFilterFactory"
> encoder="Caverphone" inject="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> splitOnNumerics="0"/>
>                                 <filter class="solr.PhoneticFilterFactory"
> encoder="Caverphone" inject="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>                 <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> I've spent a couple days researching this issue, and my best guess at
> a fix would be to re-index this data using the LowerCaseFilterFatory
> so that all email addresses are stored in lower case, but that would
> be a significant change as I have over 10 million docs indexed.  In
> addition, its strange that we get search results on some mixed case
> email addresses, but not all, so I'm hoping that maybe all we need is
> to tweak the query analyzer?  Thanks in advance for your help with
> this question.  Please let me know if you need any additional details.
>
> -Miguel
>
>
>
> ________________________________
>
> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
> sublicensees (including Ovation Travel Group and Egencia) use certain
> trademarks and service marks of American Express Company or its
> subsidiaries (American Express) in the 'American Express Global
> Business Travel' and 'American Express Meetings & Events' brands and
> in connection with its business for permitted uses only under a
> limited licence from American Express (Licensed Marks). The Licensed
> Marks are trademarks or service marks of, and the property of,
> American Express. GBT UK is a subsidiary of Global Business Travel
> Group, Inc. (NYSE: GBTG). American Express holds a minority interest
> in GBTG, which operates as a separate company from American Express.
>
> ________________________________
>
> This email message and all attachments transmitted with it are solely
> for the use of the intended recipient(s) and may contain confidential
> and/or privileged information. If the reader of this message is not
> the intended recipient, you are hereby notified that any
> dissemination, distribution, copying and/or other use of this message
> or its attachments is strictly prohibited. If you have received this
> message in error, please notify the sender and delete it immediately.
> Unintended transmission shall not constitute a waiver of the attorney-client or any other privilege.
>
> ________________________________
> Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de
> sous-licence autoris?s (notamment Ovation Travel Group et Egencia)
> utilise certaines marques commerciales et marques de services
> d'American Express Company ou de ses filiales (American Express) dans
> les marques < American Express Global Business Travel > et < American
> Express Meetings & Events > ainsi qu'en lien avec son activit?, ? des
> fins autoris?es uniquement, sous une licence limit?e accord?e par American Express (marques sous licence).
> Les marques sous licence sont des marques commerciales ou des marques
> de services d'American Express, dont elles sont la propri?t?. GBT UK
> est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
> American Express d?tient une participation minoritaire dans GBTG, qui
> op?re en tant que soci?t? distincte d'American Express.
>
> ________________________________
>
> Ce message ?lectronique et toutes les pi?ces jointes transmises avec
> celui-ci sont uniquement destin?s ? l'usage du ou des destinataires
> vis?s et peuvent contenir des informations confidentielles et/ou
> privil?gi?es. Si le lecteur de ce message n'est pas le destinataire pr?vu, vous ?tes inform?
> par la pr?sente que toute diffusion, distribution, copie et/ou autre
> utilisation de ce message ou de ses pi?ces jointes est strictement
> interdite. Si vous avez re?u ce message par erreur, veuillez en
> informer l'exp?diteur et le supprimer imm?diatement. Une transmission
> involontaire ne constitue pas une renonciation au secret professionnel
> ou ? toute autre pr?rogative.
>
> ________________________________
>


________________________________

Notice: GBT Travel Services UK Limited (GBT UK) and its authorised sublicensees (including Ovation Travel Group and Egencia) use certain trademarks and service marks of American Express Company or its subsidiaries (American Express) in the 'American Express Global Business Travel' and 'American Express Meetings & Events' brands and in connection with its business for permitted uses only under a limited licence from American Express (Licensed Marks). The Licensed Marks are trademarks or service marks of, and the property of, American Express. GBT UK is a subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American Express holds a minority interest in GBTG, which operates as a separate company from American Express.

________________________________

This email message and all attachments transmitted with it are solely for the use of the intended recipient(s) and may contain confidential and/or privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, copying and/or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender and delete it immediately. Unintended transmission shall not constitute a waiver of the attorney-client or any other privilege.

________________________________
Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise certaines marques commerciales et marques de services d’American Express Company ou de ses filiales (American Express) dans les marques « American Express Global Business Travel » et « American Express Meetings & Events » ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous une licence limitée accordée par American Express (marques sous licence). Les marques sous licence sont des marques commerciales ou des marques de services d’American Express, dont elles sont la propriété. GBT UK est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American Express détient une participation minoritaire dans GBTG, qui opère en tant que société distincte d’American Express.

________________________________

Ce message électronique et toutes les pièces jointes transmises avec celui-ci sont uniquement destinés à l’usage du ou des destinataires visés et peuvent contenir des informations confidentielles et/ou privilégiées. Si le lecteur de ce message n’est pas le destinataire prévu, vous êtes informé par la présente que toute diffusion, distribution, copie et/ou autre utilisation de ce message ou de ses pièces jointes est strictement interdite. Si vous avez reçu ce message par erreur, veuillez en informer l’expéditeur et le supprimer immédiatement. Une transmission involontaire ne constitue pas une renonciation au secret professionnel ou à toute autre prérogative.

________________________________

RE: Solr Search - Mixed Case Issue

Posted by Miguel Joy <Mi...@amexgbt.com.INVALID>.
Hi Markus,

Thanks for your prompt reply to my issue.  I will try your suggestions and report back.

Thanks,
-Miguel

-----Original Message-----
From: Markus Jelsma <ma...@openindex.io>
Sent: Tuesday, September 27, 2022 6:36 AM
To: users@solr.apache.org
Subject: Re: Solr Search - Mixed Case Issue

CAUTION: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and expect that the content is safe.

Hello Miguel,

The problem lies with the different index-time and query-time WordDelimiterFilter configurations.

> In addition, its strange that we get search results on some mixed case
email addresses

Yes, precisely!

See the splitOnCaseChange attributes, that is where the problem is. In your case you should be able to copy the index-time configuration to the query-time and get rid of the problem without reindex. It 'should' solve the problem. If not, try to enable catenateAll, on both sides, but that requires reindex.

Ideally you should probably also get rid of the StopFilterFactory, unless very well configured (which i do not suspect) it will cause additional weird problems. This does require reindexing.

Regards,
Markus

Op di 27 sep. 2022 om 11:55 schreef Miguel Joy
<Mi...@amexgbt.com.invalid>:

> Hi all,
>
> I'm new to Solr and recently inherited a Solr application (version
> 5.4) from a previous developer with very little documentation.  At any
> rate, my problem is this:
>
> I have some email addresses that are stored as mixed case.
>
> Tom.Jones@acme.com<ma...@acme.com> = Success [querying for
> this email address and passing in the full email address in any case
> [upper or lower] returns the correct result]
>
> Kevin.McNeil@acme.com<ma...@acme.com> = Fail [querying
> for this email address and passing in the full email address in any
> case [upper or lower] returns zero results]
>
> And here's the fieldType definition that's used for email addresses:
>
> <fieldType name="text_phonetic" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer type="index">
>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> splitOnNumerics="0"/>
>                 <filter class="solr.PhoneticFilterFactory"
> encoder="Caverphone" inject="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> splitOnNumerics="0"/>
>                                 <filter class="solr.PhoneticFilterFactory"
> encoder="Caverphone" inject="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>                 <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> I've spent a couple days researching this issue, and my best guess at
> a fix would be to re-index this data using the LowerCaseFilterFatory
> so that all email addresses are stored in lower case, but that would
> be a significant change as I have over 10 million docs indexed.  In
> addition, its strange that we get search results on some mixed case
> email addresses, but not all, so I'm hoping that maybe all we need is
> to tweak the query analyzer?  Thanks in advance for your help with
> this question.  Please let me know if you need any additional details.
>
> -Miguel
>
>
>
> ________________________________
>
> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
> sublicensees (including Ovation Travel Group and Egencia) use certain
> trademarks and service marks of American Express Company or its
> subsidiaries (American Express) in the 'American Express Global
> Business Travel' and 'American Express Meetings & Events' brands and
> in connection with its business for permitted uses only under a
> limited licence from American Express (Licensed Marks). The Licensed
> Marks are trademarks or service marks of, and the property of,
> American Express. GBT UK is a subsidiary of Global Business Travel
> Group, Inc. (NYSE: GBTG). American Express holds a minority interest
> in GBTG, which operates as a separate company from American Express.
>
> ________________________________
>
> This email message and all attachments transmitted with it are solely
> for the use of the intended recipient(s) and may contain confidential
> and/or privileged information. If the reader of this message is not
> the intended recipient, you are hereby notified that any
> dissemination, distribution, copying and/or other use of this message
> or its attachments is strictly prohibited. If you have received this
> message in error, please notify the sender and delete it immediately.
> Unintended transmission shall not constitute a waiver of the attorney-client or any other privilege.
>
> ________________________________
> Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de
> sous-licence autoris?s (notamment Ovation Travel Group et Egencia)
> utilise certaines marques commerciales et marques de services
> d'American Express Company ou de ses filiales (American Express) dans
> les marques < American Express Global Business Travel > et < American
> Express Meetings & Events > ainsi qu'en lien avec son activit?, ? des
> fins autoris?es uniquement, sous une licence limit?e accord?e par American Express (marques sous licence).
> Les marques sous licence sont des marques commerciales ou des marques
> de services d'American Express, dont elles sont la propri?t?. GBT UK
> est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
> American Express d?tient une participation minoritaire dans GBTG, qui
> op?re en tant que soci?t? distincte d'American Express.
>
> ________________________________
>
> Ce message ?lectronique et toutes les pi?ces jointes transmises avec
> celui-ci sont uniquement destin?s ? l'usage du ou des destinataires
> vis?s et peuvent contenir des informations confidentielles et/ou
> privil?gi?es. Si le lecteur de ce message n'est pas le destinataire pr?vu, vous ?tes inform?
> par la pr?sente que toute diffusion, distribution, copie et/ou autre
> utilisation de ce message ou de ses pi?ces jointes est strictement
> interdite. Si vous avez re?u ce message par erreur, veuillez en
> informer l'exp?diteur et le supprimer imm?diatement. Une transmission
> involontaire ne constitue pas une renonciation au secret professionnel
> ou ? toute autre pr?rogative.
>
> ________________________________
>


________________________________

Notice: GBT Travel Services UK Limited (GBT UK) and its authorised sublicensees (including Ovation Travel Group and Egencia) use certain trademarks and service marks of American Express Company or its subsidiaries (American Express) in the 'American Express Global Business Travel' and 'American Express Meetings & Events' brands and in connection with its business for permitted uses only under a limited licence from American Express (Licensed Marks). The Licensed Marks are trademarks or service marks of, and the property of, American Express. GBT UK is a subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American Express holds a minority interest in GBTG, which operates as a separate company from American Express.

________________________________

This email message and all attachments transmitted with it are solely for the use of the intended recipient(s) and may contain confidential and/or privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, copying and/or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender and delete it immediately. Unintended transmission shall not constitute a waiver of the attorney-client or any other privilege.

________________________________
Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise certaines marques commerciales et marques de services d’American Express Company ou de ses filiales (American Express) dans les marques « American Express Global Business Travel » et « American Express Meetings & Events » ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous une licence limitée accordée par American Express (marques sous licence). Les marques sous licence sont des marques commerciales ou des marques de services d’American Express, dont elles sont la propriété. GBT UK est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American Express détient une participation minoritaire dans GBTG, qui opère en tant que société distincte d’American Express.

________________________________

Ce message électronique et toutes les pièces jointes transmises avec celui-ci sont uniquement destinés à l’usage du ou des destinataires visés et peuvent contenir des informations confidentielles et/ou privilégiées. Si le lecteur de ce message n’est pas le destinataire prévu, vous êtes informé par la présente que toute diffusion, distribution, copie et/ou autre utilisation de ce message ou de ses pièces jointes est strictement interdite. Si vous avez reçu ce message par erreur, veuillez en informer l’expéditeur et le supprimer immédiatement. Une transmission involontaire ne constitue pas une renonciation au secret professionnel ou à toute autre prérogative.

________________________________

Re: Solr Search - Mixed Case Issue

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Miguel,

The problem lies with the different index-time and query-time
WordDelimiterFilter configurations.

> In addition, its strange that we get search results on some mixed case
email addresses

Yes, precisely!

See the splitOnCaseChange attributes, that is where the problem is. In your
case you should be able to copy the index-time configuration to the
query-time and get rid of the problem without reindex. It 'should' solve
the problem. If not, try to enable catenateAll, on both sides, but that
requires reindex.

Ideally you should probably also get rid of the StopFilterFactory, unless
very well configured (which i do not suspect) it will cause additional
weird problems. This does require reindexing.

Regards,
Markus

Op di 27 sep. 2022 om 11:55 schreef Miguel Joy
<Mi...@amexgbt.com.invalid>:

> Hi all,
>
> I'm new to Solr and recently inherited a Solr application (version 5.4)
> from a previous developer with very little documentation.  At any rate, my
> problem is this:
>
> I have some email addresses that are stored as mixed case.
>
> Tom.Jones@acme.com<ma...@acme.com> = Success [querying for
> this email address and passing in the full email address in any case [upper
> or lower] returns the correct result]
>
> Kevin.McNeil@acme.com<ma...@acme.com> = Fail [querying for
> this email address and passing in the full email address in any case [upper
> or lower] returns zero results]
>
> And here's the fieldType definition that's used for email addresses:
>
> <fieldType name="text_phonetic" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer type="index">
>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> splitOnNumerics="0"/>
>                 <filter class="solr.PhoneticFilterFactory"
> encoder="Caverphone" inject="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> splitOnNumerics="0"/>
>                                 <filter class="solr.PhoneticFilterFactory"
> encoder="Caverphone" inject="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>                 <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> I've spent a couple days researching this issue, and my best guess at a
> fix would be to re-index this data using the LowerCaseFilterFatory so that
> all email addresses are stored in lower case, but that would be a
> significant change as I have over 10 million docs indexed.  In addition,
> its strange that we get search results on some mixed case email addresses,
> but not all, so I'm hoping that maybe all we need is to tweak the query
> analyzer?  Thanks in advance for your help with this question.  Please let
> me know if you need any additional details.
>
> -Miguel
>
>
>
> ________________________________
>
> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
> sublicensees (including Ovation Travel Group and Egencia) use certain
> trademarks and service marks of American Express Company or its
> subsidiaries (American Express) in the 'American Express Global Business
> Travel' and 'American Express Meetings & Events' brands and in connection
> with its business for permitted uses only under a limited licence from
> American Express (Licensed Marks). The Licensed Marks are trademarks or
> service marks of, and the property of, American Express. GBT UK is a
> subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American
> Express holds a minority interest in GBTG, which operates as a separate
> company from American Express.
>
> ________________________________
>
> This email message and all attachments transmitted with it are solely for
> the use of the intended recipient(s) and may contain confidential and/or
> privileged information. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution,
> copying and/or other use of this message or its attachments is strictly
> prohibited. If you have received this message in error, please notify the
> sender and delete it immediately. Unintended transmission shall not
> constitute a waiver of the attorney-client or any other privilege.
>
> ________________________________
> Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de
> sous-licence autoris?s (notamment Ovation Travel Group et Egencia) utilise
> certaines marques commerciales et marques de services d'American Express
> Company ou de ses filiales (American Express) dans les marques < American
> Express Global Business Travel > et < American Express Meetings & Events >
> ainsi qu'en lien avec son activit?, ? des fins autoris?es uniquement, sous
> une licence limit?e accord?e par American Express (marques sous licence).
> Les marques sous licence sont des marques commerciales ou des marques de
> services d'American Express, dont elles sont la propri?t?. GBT UK est une
> filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American
> Express d?tient une participation minoritaire dans GBTG, qui op?re en tant
> que soci?t? distincte d'American Express.
>
> ________________________________
>
> Ce message ?lectronique et toutes les pi?ces jointes transmises avec
> celui-ci sont uniquement destin?s ? l'usage du ou des destinataires vis?s
> et peuvent contenir des informations confidentielles et/ou privil?gi?es. Si
> le lecteur de ce message n'est pas le destinataire pr?vu, vous ?tes inform?
> par la pr?sente que toute diffusion, distribution, copie et/ou autre
> utilisation de ce message ou de ses pi?ces jointes est strictement
> interdite. Si vous avez re?u ce message par erreur, veuillez en informer
> l'exp?diteur et le supprimer imm?diatement. Une transmission involontaire
> ne constitue pas une renonciation au secret professionnel ou ? toute autre
> pr?rogative.
>
> ________________________________
>