You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/04/17 16:36:27 UTC

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858144#action_12858144 ] 

Robert Muir commented on LUCENE-1343:
-------------------------------------

OK! I think we have a good solution here!.

We can use ICU's Normalizer2 to implement this, by simply creating a custom normalization mapping.
This way we can meet multiple use-cases, e.g. someone wants to remove diacritics, someone else doesn't.

And we get solid unicode behavior and high performance to boot.

So I will keep this issue open, I think the best solution is to take the accent-folding mappings here (or use the ones in AsciiFoldingFilter?) and create a .txt file of mappings, passing it to gennorm2 along with NFKC case fold mappings.

This way we can implement this on top of LUCENE-2399, all compiled to an efficient binary form with no code.
I'll take a shot at this once LUCENE-2399 is resolved.

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org