You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Stavros Macrakis <ma...@alum.mit.edu> on 2020/09/09 19:29:45 UTC

Re: Lowercase-ing everything but acronyms

I can't help you on the implementation issues, but...

You may want to do something a little different than keep all-uppercase
tokens in upper case. You may want simply to special-case all-uppercase
stopwords, so that they are not ignored. The poster boy for that is IT,
which in my last search application, was *extremely common *and important.
On the corpus side, [it] and [IT] are very distinct. But on the query side,
most users will write [it], so it's fine to have it in the index as [it]
and not [IT]. Similarly for ON (Ontario) and ME (Maine). A nasty one is OR:
if you are using all-uppercase OR for the Boolean operator, how do users
enter OR meaning Operations Research? We know that not many users will
write ["OR"]. So you may simply want to allow lowercase [or] in the query
to match uppercase [OR] in the corpus, and reserve uppercase OR for the
Boolean operator.  Other cases are much rarer (Dijsktra's THE operating
system is of historical interest only...). For non-stopwords, there doesn't
seem to be much of a problem.

              -s

On Wed, Sep 9, 2020 at 2:59 PM Dunham-Wilkie, Mike CITZ:EX <
Mike.Dunham-Wilkie@gov.bc.ca> wrote:

> Hi SOLR list,
>
> I'm currently using the White Space tokenizer and the Lower Case filter
> with SOLR 7.3.  I'd like to modify the logic to keep any tokens that are
> entirely upper case as upper case, and just apply the Lower Case filter (or
> something equivalent) to the remaining tokens.  Is there a way to do this
> using tokenizers and filters?
>
> Thanks
> Mike
>
>
> Mike Dunham-Wilkie | Senior Spatial Data Administration Analyst | PHONE...
> 778-676-1791
> Data Systems & Services - Digital Platforms and Data Division - Ministry
> of Citizens' Services
>
> For faster response and/or future inquires, the following email addresses
> are monitored continuously:
> BC Geographic Warehouse (BCGW) and Replication/ETL | DataBC Data
> Architecture Services (databc.da@gov.bc.ca<ma...@gov.bc.ca>)
> BC Data Catalogue (BCDC) and Open Data | DataBC Catalogue Services (
> datacat@gov.bc.ca<ma...@gov.bc.ca>)
>
>