You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Dunham-Wilkie, Mike CITZ:EX" <Mi...@gov.bc.ca> on 2020/09/09 18:58:56 UTC

Lowercase-ing everything but acronyms

Hi SOLR list,

I'm currently using the White Space tokenizer and the Lower Case filter with SOLR 7.3.  I'd like to modify the logic to keep any tokens that are entirely upper case as upper case, and just apply the Lower Case filter (or something equivalent) to the remaining tokens.  Is there a way to do this using tokenizers and filters?

Thanks
Mike


Mike Dunham-Wilkie | Senior Spatial Data Administration Analyst | PHONE... 778-676-1791
Data Systems & Services - Digital Platforms and Data Division - Ministry of Citizens' Services

For faster response and/or future inquires, the following email addresses are monitored continuously:
BC Geographic Warehouse (BCGW) and Replication/ETL | DataBC Data Architecture Services (databc.da@gov.bc.ca<ma...@gov.bc.ca>)
BC Data Catalogue (BCDC) and Open Data | DataBC Catalogue Services (datacat@gov.bc.ca<ma...@gov.bc.ca>)


Re: Lowercase-ing everything but acronyms

Posted by Stavros Macrakis <ma...@alum.mit.edu>.
I can't help you on the implementation issues, but...

You may want to do something a little different than keep all-uppercase
tokens in upper case. You may want simply to special-case all-uppercase
stopwords, so that they are not ignored. The poster boy for that is IT,
which in my last search application, was *extremely common *and important.
On the corpus side, [it] and [IT] are very distinct. But on the query side,
most users will write [it], so it's fine to have it in the index as [it]
and not [IT]. Similarly for ON (Ontario) and ME (Maine). A nasty one is OR:
if you are using all-uppercase OR for the Boolean operator, how do users
enter OR meaning Operations Research? We know that not many users will
write ["OR"]. So you may simply want to allow lowercase [or] in the query
to match uppercase [OR] in the corpus, and reserve uppercase OR for the
Boolean operator.  Other cases are much rarer (Dijsktra's THE operating
system is of historical interest only...). For non-stopwords, there doesn't
seem to be much of a problem.

              -s

On Wed, Sep 9, 2020 at 2:59 PM Dunham-Wilkie, Mike CITZ:EX <
Mike.Dunham-Wilkie@gov.bc.ca> wrote:

> Hi SOLR list,
>
> I'm currently using the White Space tokenizer and the Lower Case filter
> with SOLR 7.3.  I'd like to modify the logic to keep any tokens that are
> entirely upper case as upper case, and just apply the Lower Case filter (or
> something equivalent) to the remaining tokens.  Is there a way to do this
> using tokenizers and filters?
>
> Thanks
> Mike
>
>
> Mike Dunham-Wilkie | Senior Spatial Data Administration Analyst | PHONE...
> 778-676-1791
> Data Systems & Services - Digital Platforms and Data Division - Ministry
> of Citizens' Services
>
> For faster response and/or future inquires, the following email addresses
> are monitored continuously:
> BC Geographic Warehouse (BCGW) and Replication/ETL | DataBC Data
> Architecture Services (databc.da@gov.bc.ca<ma...@gov.bc.ca>)
> BC Data Catalogue (BCDC) and Open Data | DataBC Catalogue Services (
> datacat@gov.bc.ca<ma...@gov.bc.ca>)
>
>