You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/06/13 05:17:07 UTC
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

    [ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719069#action_12719069 ] 

Robert Muir commented on LUCENE-1581:
-------------------------------------

For reference, I think the concept of LowerCaseFilter, either with or without Locale is incorrect for lucene when the intent is really to erase case differences.

There is an important distinction between converting to lowercase (for presentation), and erasing case differences (for matching and searching).

Here is an example from the unicode std:
Characters may also have different case mappings, depending on the context. For example,
U+03A3 "Σ" greek capital letter sigma lowercases to U+03C3 "σ" greek small letter
sigma if it is followed by another letter, but lowercases to U+03C2 "ς" greek small
letter final sigma if it is not.

The only correct methods to erase case differences are:
1) Localized (for a specific language): use a collator as recommended here.
2) Multilingual (for a mix of languages): use either the UCA (collator with ROOT locale) or unicode case-folding, either of which is only an approximation of the language-specific rules involved.

thanks!


> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-1581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1581
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Digy
>         Attachments: TestTurkishCollation.java
>
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
> 	public class SomeAnalyzer : Analyzer
>     	{
> 		public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
> 	        {
>             		TokenStream t = new SomeTokenizer(reader);
> 		        t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
> 			t = new LowerCaseFilter(t);
> 		        return t;
> 		}
>         
>     	}
> {code}
> 	
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
> 	"i" (if locale is "en-US") 
> 	or 
> 	"ı' if(locale is "tr-TR") (that means,this token should be input to another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
>     public sealed class LowerCaseFilter : TokenFilter
>     {
>         /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture;
>         public LowerCaseFilter(TokenStream in) : base(in)
>         {
>         }
>         /* +++ */  public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in)
>         /* +++ */  {
>         /* +++ */      this.CultureInfo = CultureInfo;
>         /* +++ */  }
> 		
>         public override Token Next(Token result)
>         {
>             result = Input.Next(result);
>             if (result != null)
>             {
>                 char[] buffer = result.TermBuffer();
>                 int length = result.termLength;
>                 for (int i = 0; i < length; i++)
>                     /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo);
>                 return result;
>             }
>             else
>                 return null;
>         }
>     }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org