You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Nitin Shiralkar <ni...@coreobjects.com> on 2009/06/02 13:47:57 UTC

Case-sensitivity problem for un_tokenized fields

Hi All,

We have a "un_tokenized" field in Lucene, which contains string values like "DMSKM_1234", "rpsla_5678" etc. We observed that the searches on this field are not working as expected. If we search for "DMSKM_1234" using standard analyzer, then the required document is never returned. However if we search for "rpsla_5678" then the results are as expected. I believe that the problem is because "un_tokenized" fields are not passed through analyzer and indexed without any changes (For example, "DMSKM_1234" would be indexed without any case changes). However when we search using "DMSKM_1234", the query parser object converts that to lowercase "dmskm_1234". Since the "un-tokenized" fields are subjected to exact match, the expected result is never returned. The system works well for "rpsla_5678" values because all letters are already in lower-case.

The possible solution that I came to know is to make use of "SetLowercaseExpandedTerms" property on QueryParser object and set the value to FALSE. This would not convert the keywords into lower case. However this will make the searches case-sensitive.

Questions:

1. Is there a better to way to handle above un-tokenized field to enable case-insensitive searches?

2. What would be impact of setting "SetLowercaseExpandedTerms" to TRUE on tokenized fields? For example if the query is "title:agreement AND dmsid:DMSKM_1234" where "Title" is a tokenized field and "dmsid" is a un-tokenized field. Will Title field would also become case-sensitive?

Limitation:

We are trying to avoid index re-building effort due to huge size and would like resolve above problem in context of index searching.

Thanks & regards,

________________________________

Nitin Shiralkar | Engineering Lead

Phone : +91 (20) 40119 113

CoreObjects

Fax : +91 (80) 40119 111
Cell : +91 988137 0303

We build the software that builds companiesTM

Website: www.coreobjects.com<http://www.coreobjects.com/>

________________________________
This email and any files transmitted with it are confidential and privileged information, intended solely for the use of the individual or entity to whom they are addressed. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the system manager, contact the sender by reply email and destroy all copies of the original message. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

Re: Case-sensitivity problem for un_tokenized fields

Posted by Robert Jordan <ro...@gmx.net>.

Hi,

Nitin Shiralkar wrote:
> Questions:
> 
> 
> 1.       Is there a better to way to handle above un-tokenized field
> to enable case-insensitive searches?

Try a variation of this QueryParser subclass. The class assumes
that only the field "title" is tokenized/analyzed. For other
fields, a TermQuery will be used:


     public class ExtendedQueryParser : QueryParser
     {
         public LuciferQueryParser(string field, Analyzer analyzer)
             : base(field, analyzer)
         {
         }

         public override Query GetFieldQuery(string field, string queryText)
         {
             if (field == "title")
                 return base.GetFieldQuery(field, queryText);

             return new TermQuery(new Term(field, queryText));
         }

         public override Query GetWildcardQuery(string field, string 
termStr)
         {
             try
             {
                 if (field != "title")
                     SetLowercaseExpandedTerms(false);

                 return base.GetWildcardQuery(field, termStr);
             }
             finally
             {
                 if (field != "title")
                     SetLowercaseExpandedTerms(true);
             }
         }

         public override Query GetPrefixQuery(string field, string termStr)
         {
             try
             {
                 if (field != "title")
                     SetLowercaseExpandedTerms(false);

                 return base.GetPrefixQuery(field, termStr);
             }
             finally
             {
                 if (field != "title")
                     SetLowercaseExpandedTerms(true);
             }
         }
     }


Robert