You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Simon Willnauer (JIRA)" <ji...@apache.org> on 2009/06/16 16:37:07 UTC

[jira] Created: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Added New Token API impl for ASCIIFoldingFilter
-----------------------------------------------

                 Key: LUCENE-1696
                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
    Affects Versions: 2.9
            Reporter: Simon Willnauer
             Fix For: 2.9
         Attachments: ASCIIFoldingFilter._newTokenAPI.patch

I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
I will attach the patch shortly.
Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.

simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller reassigned LUCENE-1696:
-----------------------------------

    Assignee: Mark Miller

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720189#action_12720189 ] 

Simon Willnauer commented on LUCENE-1696:
-----------------------------------------

bq. i don't see an alternative, otherwise you will end out with 50-100 sets of language-dependent rules [essentially duplicating the logic collation already knows about]

I agree, that this would end up in a mess. Still collation is not an option as I can not rely on the local in that use-case.
I might have to stick with my changes for umlauts at least. :)

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730868#action_12730868 ] 

Uwe Schindler commented on LUCENE-1696:
---------------------------------------

No, it is a new class in 2.9 :-) ASCIIFoldingFilter is not in 2.4.1

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1696:
--------------------------------

    Attachment: TestGermanCollation.java

show how to do this with german... its a bit more involved since its non-standard collation behavior, but not too difficult.

you can do this with the jdk version too, i always show the ICU implementation because of its performance. both are available in contrib/collation


> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720192#action_12720192 ] 

Simon Willnauer commented on LUCENE-1696:
-----------------------------------------

Thanks robert,
I did know about collation before and I validated it for the usecase - I do not know what language / local my docs are so I can not set the correct one. Nevermind. :)

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730867#action_12730867 ] 

Mark Miller edited comment on LUCENE-1696 at 7/14/09 7:26 AM:
--------------------------------------------------------------

Heh - hate to sound like a broken record, but: making this class final breaks back compat?

      was (Author: markrmiller@gmail.com):
    Heh - hate to sound like a broken record, but: making this class finally breaks back compat?
  
> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Closed: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler closed LUCENE-1696.
---------------------------------

    Resolution: Fixed

Resolved with LUCENE-1693. Thanks Simon for the original patch!

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720193#action_12720193 ] 

Robert Muir commented on LUCENE-1696:
-------------------------------------

simon, actually i think its documented you can use ENGLISH collator and it will behave like asciifolding filter (simply remove all diacritics).
you could then apply the tailorings like the example and get the behavior you want, versus maintaining a custom asciifoldingfilter...

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730866#action_12730866 ] 

Uwe Schindler commented on LUCENE-1696:
---------------------------------------

I already iplmeneted the new API in this filter for LUCENE-1693. Patch will come shortly together with this issue.

The old API can be removed, the filter is now final and so next() and nextToken() can be left unimplemented.

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720197#action_12720197 ] 

Simon Willnauer commented on LUCENE-1696:
-----------------------------------------


bq. simon, actually i think its documented you can use ENGLISH collator and it will behave like asciifolding filter (simply remove all diacritics).
you could then apply the tailorings like the example and get the behavior you want, versus maintaining a custom asciifoldingfilter... 
will try, thanks!

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720173#action_12720173 ] 

Robert Muir commented on LUCENE-1696:
-------------------------------------

Simon, I think if you want to handle accents in a language-dependent/correct way, you can use contrib/collation for this purpose.

i don't see an alternative, otherwise you will end out with 50-100 sets of language-dependent rules [essentially duplicating the logic collation already knows about]

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730867#action_12730867 ] 

Mark Miller commented on LUCENE-1696:
-------------------------------------

Heh - hate to sound like a broken record, but: making this class finally breaks back compat?

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720201#action_12720201 ] 

Robert Muir commented on LUCENE-1696:
-------------------------------------

since this seems to be a recurring theme maybe a javadoc modification would be useful.

otherwise i imagine you might receive lots of bug reports saying 'asciifoldingfilter does X for Y language incorrectly'.

part of the confusion might be because the docs say it 'converts to their ASCII equivalents' and 'equivalent' means different things to different people in different languages...


> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730870#action_12730870 ] 

Mark Miller commented on LUCENE-1696:
-------------------------------------

Ah, thanks. Thats hard to keep track of. It feels like I committed this so long ago that it couldn't possibly be new ;)

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-1696:
------------------------------------

    Attachment: ASCIIFoldingFilter._newTokenAPI.patch

all tests pass

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721054#action_12721054 ] 

Mark Miller commented on LUCENE-1696:
-------------------------------------

Patch looks good! I'll just hold off till the token api improvement patch is finished, just in case we need to make an adjustment here.

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721116#action_12721116 ] 

Simon Willnauer commented on LUCENE-1696:
-----------------------------------------

I will be around and fix / adjust it if it needs some changes. If I do not react please send me a ping on this issue. Thanks

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler reassigned LUCENE-1696:
-------------------------------------

    Assignee: Uwe Schindler  (was: Mark Miller)

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720183#action_12720183 ] 

Robert Muir commented on LUCENE-1696:
-------------------------------------

i uploaded a testcase under LUCENE-1581 showing how this works with contrib/collation.

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut  like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word "süd" would be folded to "sud". In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org