You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Haschart (JIRA)" <ji...@apache.org> on 2008/07/22 21:29:31 UTC

[jira] Created: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
-----------------------------------------------------------------------------------------------------------------------------

                 Key: LUCENE-1343
                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Robert Haschart
            Priority: Minor


The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   

The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1343) A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1343:
--------------------------------

    Attachment: LUCENE-1343.patch

attached is a modified patch (i will upload the new datafile too).
* applied ICU or Unicode copyright headers to any datafiles where I sourced from their data, and added a mention to NOTICE.txt to that effect.
* added some additional punctuation mappings to ensure it contains all ASCIIFoldingFilter foldings

As noted previously, there are 5 places where this disagrees with ASCIIFoldingFilter:
U+1E9B: LATIN SMALL LETTER LONG S WITH DOT ABOVE (should be s)
U+2033: DOUBLE PRIME (should be two single quotes)
U+2036: REVERSED DOUBLE PRIME (same as above)
U+2038: CARET (folds to CIRCUMFLEX ACCENT, which should be deleted as its [:Diacritic:]
U+FF3E: FULLWIDTH CIRCUMFLEX ACCENT (same as above)

I plan to commit in a few days if no one objects.


> A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Robert Haschart
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1343.patch, LUCENE-1343.patch, normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java, utr30.nrm, utr30.nrm
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786712#action_12786712 ] 

Ken Krugler commented on LUCENE-1343:
-------------------------------------

Just to make sure this point doesn't get lost in the discussion over normalization - the issue of "visual normalization" is one that I think ISOLatin1AccentFilter originally was trying to address. Specifically how to fold together forms of letters that a user, when typing, might consider equivalent.

This is indeed language specific, and re-implementing support that's already in ICU4J is clearly a Bad Idea.

I think there's value in a general normalizer that implements the Unicode Consortium's algorithm/data for normalization of int'l domain names, as this is intended to avoid visual spoofing of domain names.

Don't know/haven't tracked if or when this is going into ICU4J. But (similar to ICU generic sorting) it provides a useful locale-agnostic approach that would work well-enough for most Lucene use cases.

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622432#action_12622432 ] 

Ken Krugler commented on LUCENE-1343:
-------------------------------------

Hi Robert,

FWIW, the issues being discussed here are very similar to those covered by the [Unicode Security Considerations|http://www.unicode.org/reports/tr36/] technical report #36, and associated data found in the [Unicode Security Mechanisms|http://www.unicode.org/reports/tr39/] technical report #39.

The fundamental issue for int'l domain name spoofing is detecting when two sequences of Unicode code points will render as similar glyphs...which is basically the same issue you're trying to address here, so that when you search for something you'll find all terms that "look" similar.

So for a more complete (though undoubtedly slower & bigger) solution, I'd suggest using ICU4J to do a NFKD normalization, then toss any combining/spacing marks, lower-case the result, and finally apply mappings using the data tables found in the technical report #39 referenced above.

-- Ken

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622345#action_12622345 ] 

Lance Norskog commented on LUCENE-1343:
---------------------------------------

Some languages like Cyrillic have a standard latin-1 transliteration, and deserve their own filters. 

Cyrillic is one case of this. It is based on three alphabets: 1/3 latin, 1/3 greek, and 1/3 new characters for 'ya/ye', 'ts', 'sh', 'ch', 'zh', and 'sh-ch' (fiSH CHips!).

Unit tests are the best way to document the many ways this thing can work.





> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786971#action_12786971 ] 

Robert Muir commented on LUCENE-1343:
-------------------------------------

{quote}
Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Most ancient texts did not have these marks but modern do. Even some modern representations of the ancient. While I have several semesters of koine Greek under my belt and might be wrong, there may be ambiguities where two words have the same letters but differ on marks, but they are infrequent (I don't know of any).
{quote}

I guess I brought this up because this is where you have several situations where case folding and normalization interact, eg. applying FC_NFKC set when case folding so that later NFK[CD] normalization will be closed, I know this is supposed to solve various ways the YPOGEGRAMMENI can be implemented but I forget the details...

This is why I think, the general purpose contribution should be case folding, normalization, and the stuff like this (FC_NFKC set) to make sure they work together...

If you later want to apply something more specialized like StringPrep, you need this logic anyway, see http://www.ietf.org/rfc/rfc3454.txt (especially section 3.2) 


> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1343) A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-1343.
---------------------------------

    Resolution: Fixed

Committed revision 936657.

> A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Robert Haschart
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1343.patch, LUCENE-1343.patch, normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java, utr30.nrm, utr30.nrm
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615775#action_12615775 ] 

Steven Rowe commented on LUCENE-1343:
-------------------------------------

Hi Robert,

My comments below assume you're intrestested in having this code hosted in the Lucene source repository - please disregard if that's not the case.

Have you seen the [HowToContribute page on the Lucene wiki|http://wiki.apache.org/lucene-java/HowToContribute]?  It outlines some of the basics concerning code submissions.

A couple of things I noticed that need to be addressed before the code will be accepted:

# Tab characters should be converted to spaces
# Indentation increment should be two spaces
# Test(s) should be moved from the UnicodeNormalizationFilterFactory.main() method into standalone class(es) that extend LuceneTestCase
# More/more explicit javadocs - for example, you should describe the set of provided transformations (e.g. Cyrillic diacritic stripping is included).
# Solr is a separate code base, so the UnicodeNormalizationFilterFactory should be moved to a Solr JIRA issue
# Because it has a dependency on the ICU jar, this contribution will have to live in the contrib/ area -- the Java packages name should be adjusted accordingly.
# The submission should be repackaged as a patch (instructions available on the above-linked wiki page).


> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Haschart (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622607#action_12622607 ] 

Robert Haschart commented on LUCENE-1343:
-----------------------------------------

The UnicodeNormalizationFilter does use the decompose normalization 
portion of the icu4j library as a starting point.  However even with 
that there are several instances where the normalizer code does not 
decompose a character into an unaccented character and a accent mark, a 
notable one being   ( Ł -> L )  so the UnicodeNormalizationFilter start 
with the approach you outlined, perform a decompose normalization 
followed by discarding all non-spacing modifier characters, and then can 
go on from there to further normalize the data by folding the additional 
characters that aren't handled by the decompose normalization onto their 
Latin1 lookalikes.

-Robert






> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786724#action_12786724 ] 

Robert Muir commented on LUCENE-1343:
-------------------------------------

Hi Ken, such functionality does exist, although it is new and I think still changing (you are talking about StringPrep/IDN/etc?).

If a filter for this is desired, we can do it with ICU, though I think its relatively new (probably not optimized, only works on String, etc etc)

I still think even this is stupid, because unicode encodes characters, not glyphs.

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858891#action_12858891 ] 

Robert Muir commented on LUCENE-1343:
-------------------------------------

By the way, I have been running this with the ASCIIFoldingFilter tests and ensuring its a superset (e.g. we have at least all their mappings).

But there are some bugs in ASCIIFoldingFilter that should be fixed:

For example, U+1E9B (LATIN SMALL LETTER LONG S WITH DOT ABOVE)
But in unicode. this is canonically equivalent to U+017F (LONG S) U+0307 (COMBINING DOT ABOVE)
AsciiFoldingFilter folds U+1E9B (LONG S WITH DOT) to an F
but it folds U+017F (LONG S) to an S

Unicode defines this character as a compatibility equivalent to S anyway, but its worse that ASCIIFoldingFilter is canonically inconsistent with itself.


> A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Robert Haschart
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1343.patch, normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java, utr30.nrm
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622476#action_12622476 ] 

Erik Hatcher commented on LUCENE-1343:
--------------------------------------

{quote}
Unit tests are the best way to document the many ways this thing can work.
{quote}

gets a judges score of 11 from me.  Gold for Lance for Quote of the Day.

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786941#action_12786941 ] 

DM Smith commented on LUCENE-1343:
----------------------------------

I also am dubious about a general purpose folding filter that maps letters to their ASCII look-alike and agree that folding is language dependent.

May Americans are illiterate when it comes to text with diacritics and NSM. Personally I'm nearly illiterate. I think having prominent folding filters without adequate explanation about their pitfalls or usefulness may lead illiterates into a false sense of sufficiency.

If it makes sense to have a filter for TR39 I think that should be a separate issue. If that's what this issue is all about then it's description should be modified.

I think this should otherwise be closed as a bad idea.

Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not.

Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags?

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1343) A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1343:
--------------------------------

    Attachment: utr30.nrm

updated datafile.

> A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Robert Haschart
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1343.patch, LUCENE-1343.patch, normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java, utr30.nrm, utr30.nrm
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622746#action_12622746 ] 

Ken Krugler commented on LUCENE-1343:
-------------------------------------

Hi Robert,

So given that you and the Unicode consortium seem to be working on the same problem (normalizing visually similar characters), how similar are your tables to the ones that have been developed to deter spoofing of int'l domain names?

-- Ken

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reassigned LUCENE-1343:
-----------------------------------

    Assignee: Robert Muir

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1343) A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1343:
--------------------------------

              Summary: A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.  (was: A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.)
        Fix Version/s: 3.1
    Affects Version/s: 3.1
        Lucene Fields: [New, Patch Available]  (was: [New])
          Description: 
The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   

The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

  was:
The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   

The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )


> A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Robert Haschart
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1343.patch, normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java, utr30.nrm
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858144#action_12858144 ] 

Robert Muir commented on LUCENE-1343:
-------------------------------------

OK! I think we have a good solution here!.

We can use ICU's Normalizer2 to implement this, by simply creating a custom normalization mapping.
This way we can meet multiple use-cases, e.g. someone wants to remove diacritics, someone else doesn't.

And we get solid unicode behavior and high performance to boot.

So I will keep this issue open, I think the best solution is to take the accent-folding mappings here (or use the ones in AsciiFoldingFilter?) and create a .txt file of mappings, passing it to gennorm2 along with NFKC case fold mappings.

This way we can implement this on top of LUCENE-2399, all compiled to an efficient binary form with no code.
I'll take a shot at this once LUCENE-2399 is resolved.

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786968#action_12786968 ] 

DM Smith commented on LUCENE-1343:
----------------------------------

{quote}
 bq.   Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not.

The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you "want" (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?)
{quote}

Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Most ancient texts did not have these marks but modern do. Even some modern representations of the ancient. While I have several semesters of koine Greek under my belt and might be wrong, there may be ambiguities where two words have the same letters but differ on marks, but they are infrequent (I don't know of any).

The GreekLowerCaseFilter appears to only do some of the work and only works on composed characters.

My question is not whether I'd find the filter useful, but whether it'd be a useful addition to Lucene.

{quote}
bq.   Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags?

This depends on your use case, and then you have dagesh,shin dot, too... These are all NSMs.
{quote}
I have a terrible habit of not being exact or using the proper terms. Shame on me. I meant that the latter strip all other marks.

bq. But this is going to depend on the user, and I think every person will need their own, they can use CharFilter or other ways of defining these tables.

If there is no general purpose contribution, then it should not be part of Lucene and I'll have my own.

When I do work them up, I'll create an issue or two and attach the results. If they are deemed useful then they can be added to Lucene, otherwise ignored.

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1343:
--------------------------------

    Attachment: utr30.nrm

attached is the binary file that goes in the resources/ directory.

Although I provide the ant logic to regenerate this, its kind of a pain because
* you must download/compile ICU4c (version 4.4), there is no java gennorm2
* you must run this on a big-endian machine.


> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1343.patch, normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java, utr30.nrm
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615770#action_12615770 ] 

Hoss Man commented on LUCENE-1343:
----------------------------------

Random related comment (just because this issue seemed like a good place to put it)

People may also want to consider constructing a Filter based on the substitution tables from the perl Text::Unidecode module...

http://search.cpan.org/~sburke/Text-Unidecode/
http://interglacial.com/~sburke/tpj/as_html/tpj22.html

...i have no idea how it's behavior compares to the UnicodeNormalizationFilter, just that it seems to have similar goals.

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786701#action_12786701 ] 

Robert Muir commented on LUCENE-1343:
-------------------------------------

The big picture here and all these other duplicated normalization issues across jira is related to the outdated unicode support in the JDK. 

This issue speaks of removing diacritical marks / NSM's, but the underlying issue is missing unicode normalization, duplicated here (incorrectly named): LUCENE-1215 and also here: LUCENE-1488 (disclaimer: my impl)

Speaking for the accent removal: In truth I do not think we should be simply removing NSMs because in most cases, they are there for a reason. For example, they are diacritics in a lot of european languages, but for many eastern languages they are the actual vowels. (i.e. all the indic scripts)

We need to separate the issue of missing unicode normalization (which is clearly something lucene needs), from the issue of removing diacritics (which is language-specific and doing it based on unicode properties is inappropriate).

Finally just normalizing unicode in Lucene by itself is not very useful, because there is a careful interaction with other processes and attention needs to be paid to the order in which filters are run. For example, its interaction with case folding can be a bit tricky. If you are interested in this issue I urge you to read the javadocs writeup I placed in the ICUNormalizationFilter in LUCENE-1488.


> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Haschart (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Haschart updated LUCENE-1343:
------------------------------------

    Attachment: UnicodeCharUtil.java
                UnicodeNormalizationFilter.java
                UnicodeNormalizationFilterFactory.java

Source code for UnicodeNormalizationFilter

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786689#action_12786689 ] 

Mark Miller commented on LUCENE-1343:
-------------------------------------

Mr Muir, can you take a look at this? Offer anything over the ASCIIFoldingFilter? If not, we should close, if so, what do you recommend?

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1343:
--------------------------------

    Attachment: LUCENE-1343.patch

Attached is a patch that implements UTR#30 as a tailored unicode normalization form.

Essentially it acts as a combined "Internationalized AsciiFoldingFilter" + NFKC_CaseFold (Unicode Case Folding, Default Ignorable removal, and NFKC normalization).

This is a nice alternative to just using ICUNormalizer2Filter in the case that you want "fuzzy matching" (e.g. ignore diacritical marks). 

The patch is large because it contains all the source data files necessary for gennorm2 to regenerate the 41KB binary trie file... the java implementation is trivial.


> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1343.patch, normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Haschart (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Haschart updated LUCENE-1343:
------------------------------------

    Attachment: normalizer.jar

Java 6 contains a class named java.text.Normalizer that is able to perform Unicode normalization, earlier versions of java do not have that class, and therefore need the code in this jar (which is a subset of the icu4j library) to be able to perform Unicode normalization.    The UnicodeNormalizationFilter can work with either the java 6 class java.text.Normalizer or the class com.ibm.icu.text.Normalizer in the jar here.

> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786946#action_12786946 ] 

Robert Muir commented on LUCENE-1343:
-------------------------------------

bq. Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not.

The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you "want" (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?)

bq. Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags?

This depends on your use case, and then you have dagesh,shin dot, too... These are all NSMs. But this is going to depend on the user, and I think every person will need their own, they can use CharFilter or other ways of defining these tables.


> A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed.  For example é becomes e.  However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this:  é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.    Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character  such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org