You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Christopher Schultz <ch...@christopherschultz.net> on 2022/12/16 17:50:49 UTC

Retconn'ing Solr index schema

All,

I'm trying to determine why a change was made to my internal project 
some years ago. The commit comment is unhelpful, but this field type was 
added and then we changed our "username" field in the Solr index to use 
this field-type:

"add-field-type" : {
   "name":"sortMe",
   "class":"solr.TextField",
   "analyzer":{
     "tokenizer":{
       "class":"solr.KeywordTokenizerFactory"
     }
     "filters":[{
       "class":"solr.LowercaseFilterFactory"
     }]
   }
}

The "username" field contains (wait for it) the username for a user, 
where each document in the index represents a user. We want to be able 
to search for users given their usernames and also be able to sort based 
upon the value.

I *think* the reason we changed this was because of the sorting. If you 
have a username like "foo-bar-baz" then Solr will tokenize the value 
into separate terms but we want to use the whole thing together as one 
continuous string.

We want to do the same thing with email addresses, and we used this same 
field-type for that purpose. For example, it's never useful to search 
for "gmail" in email addresses because some huge percentage of users 
come back. If you really want to search for all gmail users, we want you 
to search for "*gmail*".

Will we likely achieve our goals with the field-type specified above?

Is there a reason to lowercase everything? Does that affect sorting? It 
does not seem to affect searching.

Thanks,
-chris

Re: Retconn'ing Solr index schema

Posted by Alessandro Benedetti <a....@sease.io>.
Hi,
Apache Solr sorts by lexicographic order so uppercase/lowercase counts!

Cheers

On Sat, 17 Dec 2022, 02:51 Christopher Schultz, <
chris@christopherschultz.net> wrote:

> All,
>
> I'm trying to determine why a change was made to my internal project
> some years ago. The commit comment is unhelpful, but this field type was
> added and then we changed our "username" field in the Solr index to use
> this field-type:
>
> "add-field-type" : {
>    "name":"sortMe",
>    "class":"solr.TextField",
>    "analyzer":{
>      "tokenizer":{
>        "class":"solr.KeywordTokenizerFactory"
>      }
>      "filters":[{
>        "class":"solr.LowercaseFilterFactory"
>      }]
>    }
> }
>
> The "username" field contains (wait for it) the username for a user,
> where each document in the index represents a user. We want to be able
> to search for users given their usernames and also be able to sort based
> upon the value.
>
> I *think* the reason we changed this was because of the sorting. If you
> have a username like "foo-bar-baz" then Solr will tokenize the value
> into separate terms but we want to use the whole thing together as one
> continuous string.
>
> We want to do the same thing with email addresses, and we used this same
> field-type for that purpose. For example, it's never useful to search
> for "gmail" in email addresses because some huge percentage of users
> come back. If you really want to search for all gmail users, we want you
> to search for "*gmail*".
>
> Will we likely achieve our goals with the field-type specified above?
>
> Is there a reason to lowercase everything? Does that affect sorting? It
> does not seem to affect searching.
>
> Thanks,
> -chris
>