You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Andi Vajda (JIRA)" <ji...@apache.org> on 2008/09/17 21:39:44 UTC

[jira] Created: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
------------------------------------------------------------

                 Key: LUCENE-1390
                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
         Environment: any
            Reporter: Andi Vajda


The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
It does what it does and there is no bug with it.

It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block

That way, all languages using roman characters are covered.
A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652691#action_12652691 ] 

Robert Muir commented on LUCENE-1390:
-------------------------------------

I am using this patch and its working well.

Nitpick... wonder if you could change the mapping of Ə and ə to from E to A... This character is only used in Azeri and not too long ago (<20 years) it was written as A with umlaut, so there is some precedence.

Thanks,
Robert

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635594#action_12635594 ] 

Steven Rowe commented on LUCENE-1390:
-------------------------------------

bq. It would be good to clarify how the goals of this issue differ from LUCENE-1343 ... as an outside observer they seem to be the same thing

LUCENE-1343 depends on either Java 6 or an ICU jar.  Java 5 support (much less Java 6 support) is still a ways off, and core Lucene can't have external dependencies.  IMO, this issue intends to provide *core* Lucene support for folding a wider set of accented characters to ASCII.

In a moment I'll attach a patch that, in addition to the larger set of accented characters to be folded that Andi has provided, folds all Unicode characters I could find to their ASCII equivalents.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653500#action_12653500 ] 

Robert Muir commented on LUCENE-1390:
-------------------------------------

its a bit slower, but the difference is minor. i just ran some tests with some cpu-bound (these filters are right at the top of hprof.txt) indexes that i build

i ran em a couple times and it looks like this... not very scientific but it gives an idea.

ASCII Folding filter index time (ms): 143365
ISOLatin1Accent filter (ms): 134649


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631931#action_12631931 ] 

Steven Rowe commented on LUCENE-1390:
-------------------------------------

I think that at a minimum, the Latin Extended Additional block [U+1E00-U+1EFF] should be added, since it is included in Unicode v3.0 (the version of Unicode that Java 1.4.2 is conformant with), and it consists mainly of characters which have exact diacritic-stripped ASCII versions.  Wikipedia doesn't seem to have a page for this block, and I can't find a code chart PDF for Unicode v3.0 on unicode.org, so here's the PDF for Unicode v5.1 (the latest version):

http://www.unicode.org/charts/PDF/U1E00.pdf

Probably should check against the [Unicode 3.0 data file|http://www.unicode.org/Public/3.0-Update1/UnicodeData-3.0.1.txt] to make sure there haven't been any changes to this block between v3.0 and v5.1.0 - i.e., to make sure that the above-linked PDF is accurate for Unicode v3.0.

There is one more Latin block listed in the [Unicode 3.0 blocks data file|http://www.unicode.org/Public/3.0-Update/Blocks-3.txt]: [Latin Extended-B|http://en.wikipedia.org/wiki/Latin_Extended-B_unicode_block].  Since many of the characters in this block don't have exact diacritic-stripped ASCII versions, maybe it could be argued that they shouldn't be included in this filter.  A fair proportion of them (maybe 40%), however, do.


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653049#action_12653049 ] 

Steven Rowe commented on LUCENE-1390:
-------------------------------------

bq. What is the likelyhood that a forced upgrade to this class would lose words in an older index without a reindex? 

The problem would be words that contain characters that were not folded by ISOLatin1AccentFilter, but are folded by ASCIIFoldingFilter, and that are used in documents *and* in queries.  Individual implementors would have to make that determination, but it's not outside the realm of possibility.

If ISOLatin1AccentFilter were deprecated for 2.9, and advertised as targeted for removal in 3.0, assuming there will be a significant gap in time between the 2.9 and 3.0 releases, that would give users time to complain about its pending demise, and the plan to remove it could be revisited based on that feedback.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652694#action_12652694 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------

Could you please attach a patch for the change you requested, I'm not sure 
it's displaying correctly here. You seem to asking about a change for the 
mapping of AE and E+acute which is unexpected. Thanks !



> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632706#action_12632706 ] 

Steven Rowe commented on LUCENE-1390:
-------------------------------------

bq. The Extended-C and D blocks also have relevant things to include

These two blocks were not included in Unicode 3.0, the version supported by Java 1.4.2, which is the Java version that Lucene 2.X supports.

Nevertheless, the ranges these two blocks occupy in Unicode 5.1 are non-characters in Unicode 3.0, so I don't think it would be a problem to add them.

I'll take a look at adding more stuff this weekend.

I also will add the Unicode character descriptions to the comments for each character (e.g. "LATIN CAPITAL LETTER A WITH MACRON").

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653061#action_12653061 ] 

Mark Miller commented on LUCENE-1390:
-------------------------------------

Everything looks pretty good to me. If you can work up one last patch, I'll put it through its paces. I'd like to hear another committers opinion on deprecating ISOLatin1AccentFilter as well, but I guess we will see if we are able to draw the attention.

I think we have a lot of latitude with the 3.0 move, but I don't know what the consensus is on the limits of that latitude...

- Mark

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652898#action_12652898 ] 

Steven Rowe commented on LUCENE-1390:
-------------------------------------

bq. Steven, I can amend the patch but you said you had more changes coming. If that's the case, could you please add this change as well. If that's not the case, is it ok for me to add this change and call for this bug to be committed to trunk and closed ?

Andi, I don't think I said I had more changes coming.  If you are referring to [this comment about pre-tokenization filtering|https://issues.apache.org/jira/browse/LUCENE-1390?focusedCommentId=12643215#action_12643215], I meant that as an idea that could be pursued separately from this issue.

At any rate, please feel free to add Robert's suggested change yourself.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652911#action_12652911 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------

Great, I'll include Robert's change and try to convince a committer to 
finalize it.


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643152#action_12643152 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------

Wow, Steve, I'm impressed. This is quite an improvement over my earlier patches and even more of an improvement over ISOLatin1AccentFilter. Thank you for doing this !
What's next ? Does any Lucene committer watching this bug have objections in checking this in ?
One (minor) missing piece to the patch is the deprecation of ISOLatin1AccentFilter itself.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653139#action_12653139 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------




Yep, I'm leaning that way too.


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653064#action_12653064 ] 

Robert Muir commented on LUCENE-1390:
-------------------------------------

does ISOLatin1AccentFilter really need to be deprecated? I don't think its misleading, could just reiterate it only covers Latin 1 and ref this one in the docs?

just as this one documents what blocks it covers and i don't expect it to normalize U+338E to 'mg'


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632339#action_12632339 ] 

Steven Rowe commented on LUCENE-1390:
-------------------------------------

Andi,

What do you think of the idea of remaking this into a filter that folds all Unicode characters to their ASCII equivalents?  This would be (I think) a superset of what you've done, but would also include things like fullwidth Latin characters, smaller variants of some ASCII symbols, fancy quotation marks, etc.  Maybe non-Arabic decimal numeric characters could be converted to their Arabic versions.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653132#action_12653132 ] 

Mark Miller commented on LUCENE-1390:
-------------------------------------

Not to be wishy washy, but deprecating is looking better to me. If there is a large outcry, it can always be revisited before its fully removed. The odds that you will be affected legitimately, appear pretty low. We can call it out in 3.0 changes, and those that really need it (which is looking like it would be a weird situation to need it) could maintain their own copy. Seem reasonable?

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller reassigned LUCENE-1390:
-----------------------------------

    Assignee: Mark Miller

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653031#action_12653031 ] 

markrmiller@gmail.com edited comment on LUCENE-1390 at 12/3/08 3:00 PM:
--------------------------------------------------------------

In regards to deprecating ISOLatin1AccentFilter: what are the back compatibility issues? What is the likelyhood that a forced upgrade to this class would lose words in an older index without a reindex?

If its a real concern, we could keep the deprecated class until 3.0, but that still wouldn't help anyone that wanted to move to 2 from 3 without a reindex (if thats something we will maintain on 3).

So I'm just exploring for comments, but maybe we leave both classes?

      was (Author: markrmiller@gmail.com):
    In regards to deprecating ISOLatin1AccentFilter: what are the back compatibility issues? What is the likelyhood that a forced upgrade to this class would lose words in the index without a reindex?

If its a real concern, we could keep the deprecated class until 3.0, but that still wouldn't help anyone that wanted to move to 3 from 2 without a reindex (if thats something we will maintain on 3).

So I'm just exploring for comments, but maybe we leave both classes?
  
> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631946#action_12631946 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------


Makes sense.


I did look at that block and it looked much more remote from the purpose of 
this class. But you're right, many of these could be handled as well.

And I agree that they should be handled to be able to claim to be doing a 
complete job.

So far, I've claimed that this class handles Latin 1 and Latin Extended A 
which should cover most, if not all, european/turkish languages using latin 
script and thus goes much farther than the ISOLatin1AccentFilter in that 
respect.


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1390:
--------------------------------

    Attachment: ASCIIFoldingFilter.patch

Minor adjustment to previous version: this version fixes a couple of character mappings that were out of order in the mapped-to-capital-"B" section.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653539#action_12653539 ] 

Mark Miller commented on LUCENE-1390:
-------------------------------------

Thanks Robert. I plan to commit this in a few days with the deprecation of the latin1 filter for removal in 3.0.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653272#action_12653272 ] 

Mark Miller commented on LUCENE-1390:
-------------------------------------

So my final thought on this is performance...is handling more much slower? Could that be a reason to keep the Latin1 filter as well?

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andi Vajda updated LUCENE-1390:
-------------------------------

    Attachment:     (was: ISOLatinAccentFilter.java)

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652875#action_12652875 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------


Ah, I see now what you're asking for. Sorry about the misunderstanding.
I believe I had picked 'e' for schwa because it looks closest to that 
letter. I have no objections to switching to using 'a' instead if that's 
more "correct".
This Wikipedia seems to agree: http://en.wikipedia.org/wiki/Schwa_(Cyrillic)
This other Wikipedia http://en.wikipedia.org/wiki/Schwa is less clear about 
this, but it seems that using 'a' instead of 'e' doesn't contradict it.

Steven, I can amend the patch but you said you had more changes coming. If 
that's the case, could you please add this change as well. If that's not the 
case, is it ok for me to add this change and call for this bug to be 
committed to trunk and closed ?


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653123#action_12653123 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------

Mark, I attached a new version of the patch with Robert's change.

As for the deprecation of ISOLatin1AccentFilter.java, I don't have a definite opinion on this.
It's pretty much redundant with what this new class does. If the maintenance overhead is not too bad then keeping the duplication around may be worth the effort to preserve some backwards compat.

Thanks for taking this from here !
Andi..

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653010#action_12653010 ] 

Robert Muir commented on LUCENE-1390:
-------------------------------------

thanks guys, just as a comment to whoever is listining I think this is very useful functionality.

I am indexing a lot of docs and doing it with ICU works well, but that method (unicode decomposition etc) is very expensive and still doesnt handle many common cases. In profiling, it was slowing down entire indexing process.

The existing ISO filter doesn't handle many cases that are actually in use in my text, but this filter works well and appears to have coverage for most of the common cases such as full width forms, at the same time it is fast.

Thanks,
Robert


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652831#action_12652831 ] 

Robert Muir commented on LUCENE-1390:
-------------------------------------

sean... from your link: On 16th May 1992 the Latin alphabet for Azerbaijani was slightly revised - the letter ä was replaced with ə and the order of letters was changed as well. 

i've never seen 'ae' used in its place, certainly not in the Azeri text that I am indexing.

andi... im referring to the schwa character in azeri: U+018F (uppercase) and U+0259 (lowercase)

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653016#action_12653016 ] 

Mark Miller commented on LUCENE-1390:
-------------------------------------

Hey guys, not sure how soon I can bring some time to bear on this, but I'd be happy to help get this committed. 

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Sean Timm (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652808#action_12652808 ] 

Sean Timm commented on LUCENE-1390:
-----------------------------------

>From my brief reading, it seems that "ae" would be the best transliteration for the schwa character.  "Some people write 'æ' instead if the schwa is not available. "

http://www.omniglot.com/writing/azeri.htm

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andi Vajda updated LUCENE-1390:
-------------------------------

    Attachment: ISOLatinAccentFilter.java

 Now with the u1e00 - u1eff range.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andi Vajda updated LUCENE-1390:
-------------------------------

    Attachment: ISOLatinAccentFilter.java

The new ISOLatinAccentFilter class, superceding ISOLatin1AccentFilter.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643215#action_12643215 ] 

Steven Rowe commented on LUCENE-1390:
-------------------------------------

Some of the new mappings, e.g. the character representing parenthesized 10 mapped to the 4 character sequence "(10)", might better be handled before tokenization, very much like SOLR-822 - I thought about doing something like that but decided that it was too much work :).  However, maybe it would be possible to write a reader adapter that takes in a *token* filter (like this one) and then passes the entire contents of the reader through the filter.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ASCIIFoldingFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1390:
--------------------------------

    Summary: add ASCIIFoldingFilter and deprecate ISOLatin1AccentFilter  (was: add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter)

> add ASCIIFoldingFilter and deprecate ISOLatin1AccentFilter
> ----------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1390:
--------------------------------

    Attachment: ASCIIFoldingFilter.patch

Changes from Andi's version:

# Changed the name of the class to ASCIIFoldingFilter
# Added the Unicode chracter descriptions to comments on each character
# Added a test class
# Added several other Unicode blocks from which characters are converted to their ASCII equivalents.  Added characters include digits and punctuation.

I did not provide mappings for characters from the Math block - flattening circled plus, for example, didn't seem appropriate.

I *did* provide mappings for IPA and two other phonetic character blocks, and I'm not sure whether this is appropriate.  I was following what seemed to me to be the logic of Andi's mappings, and those provided by Latin1AccentFilter: convert characters to those that *look like* them in ASCII.  As a result, e.g., the character described as "LATIN SMALL LETTER TURNED M" (U+0270) from the IPA block is mapped to "m", regardless of its actual phonetic value.

There are lots of mappings in there now.  I generated the mappings by Perl scripting over the contents of the Unicode 5.1 version of UnicodeData.txt from Unicode.org, after grep'ing e.g. for "LATIN" and "LETTER" or "DIGRAPH", etc., and then moved things around to the appropriate places by hand.  I guess this is one weakness of this patch: it's large enough that manual verification is tough.  It's my hope that adding the Unicode character descriptions will allow for at least improved verifiability.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller resolved LUCENE-1390.
---------------------------------

       Resolution: Fixed
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed, thanks a lot guys!

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andi Vajda updated LUCENE-1390:
-------------------------------

    Attachment:     (was: ISOLatinAccentFilter.java)

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635620#action_12635620 ] 

steve_rowe edited comment on LUCENE-1390 at 9/29/08 5:24 PM:
--------------------------------------------------------------

Forgot to mention three additional change in ASCIIFoldingFilter.patch:

# Some of the added character mappings produce 3 or 4 characters - e.g. the character "⑾ " -- described as "PARENTHESIZED NUMBER ELEVEN" -- is mapped to '(' + '1' + '1' +')'. 
# As a result of the increased maximum length of each mapping, the output buffer length is set to 4 times the length of the input token.
# ArrayUtils.getNextSize() is used to resize the output buffer when it needs to grow.

      was (Author: steve_rowe):
    Forgot to mention three additional change in ASCIIFoldingFilter.patch:

# Some of the added character mappings produce 3 or 4 characters - e.g. the character "⑾ " -- described as "PARENTHESIZED NUMBER ELEVEN" -- is mapped to '('+'1'+'1'+')'. 
# As a result of the increased maximum length of each mapping, the output buffer length is set to 4 times the length of the input token.
# ArrayUtils.getNextSize() is used to resize the output buffer when it needs to grow.
  
> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632458#action_12632458 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------



I think that would be a whole lot of typing :)
Not a bad idea, still.
I'm in the process of entering the 1E00 - 1EFF range.
The Extended-C and D blocks also have relevant things to include but I'm 
hoping to stop at the Extended Additional block currently in progress.


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andi Vajda updated LUCENE-1390:
-------------------------------

    Attachment:     (was: ISOLatinAccentFilter.java)

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653031#action_12653031 ] 

Mark Miller commented on LUCENE-1390:
-------------------------------------

In regards to deprecating ISOLatin1AccentFilter: what are the back compatibility issues? What is the likelyhood that a forced upgrade to this class would lose words in the index without a reindex?

If its a real concern, we could keep the deprecated class until 3.0, but that still wouldn't help anyone that wanted to move to 3 from 2 without a reindex (if thats something we will maintain on 3).

So I'm just exploring for comments, but maybe we leave both classes?

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653520#action_12653520 ] 

Robert Muir commented on LUCENE-1390:
-------------------------------------

sorry, that wasn't a fair test case. a good chunk of those docs contain accents outside of latin1, so asciifoldingfilter was doing more work

i reran on some heavily accented (but only latin1) data and the difference was negligible, 1% or so 

appears asciifoldingfilter only slows you down versus isolatin1accentfilter in the case where it probably should be! (you have accents outside of latin1 but are using latin1accentfilter)


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated LUCENE-1390:
-------------------------------------

         Priority: Minor  (was: Major)
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
    Fix Version/s: 2.9

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andi Vajda updated LUCENE-1390:
-------------------------------

    Attachment: ASCIIFoldingFilter.patch

This latest version supercedes the previous one and moves all schwa characters to the 'A' or 'a' depending on their case. 0259, lowercase schwa, was missing and thus added.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andi Vajda updated LUCENE-1390:
-------------------------------

    Attachment: ISOLatinAccentFilter.java

ISOLatinAccentFilter.java again, now with Unicode Latin Extended B as well.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>         Attachments: ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653045#action_12653045 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------



This class includes all of ISOLatin1AccentFilter.

Still, a difference in behaviour could be seen when using the new 
filter with characters getting converted now that didn't before.

If that sort of lack of backwards compatibility is something we don't want 
to impose on the 3.0 release then the ISOLatin1AccentFilter class needs to 
be preserved.

Thanks for volunteering to finalize this bug !


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635620#action_12635620 ] 

Steven Rowe commented on LUCENE-1390:
-------------------------------------

Forgot to mention three additional change in ASCIIFoldingFilter.patch:

# Some of the added character mappings produce 3 or 4 characters - e.g. the character "⑾ " -- described as "PARENTHESIZED NUMBER ELEVEN" -- is mapped to '('+'1'+'1'+')'. 
# As a result of the increased maximum length of each mapping, the output buffer length is set to 4 times the length of the input token.
# ArrayUtils.getNextSize() is used to resize the output buffer when it needs to grow.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653099#action_12653099 ] 

Mark Miller commented on LUCENE-1390:
-------------------------------------

bq. does ISOLatin1AccentFilter really need to be deprecated? I don't think its misleading, could just reiterate it only covers Latin 1 and ref this one in the docs? 

Thats the leaning of my thinking at the moment as well.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652834#action_12652834 ] 

Robert Muir commented on LUCENE-1390:
-------------------------------------

with regards to transliteration the bgn/pcgn standard states: 

The special letter Ə, ə known as schwa, should be reproduced in that form whenever encountered.
Use Ә (U+04D8) and ә (U+04D9) for schwa when writing in the Cyrillic script, but use Ə (U+018F) and ə (U+0259) for schwa when writing in the Roman alphabet.

In those instances when it cannot be reproduced, however, the letter Ä ä may be substituted for it.

http://earth-info.nga.mil/gns/html/Romanization/Romanization_Azerbaijani.pdf


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Posted by "Andi Vajda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654160#action_12654160 ] 

Andi Vajda commented on LUCENE-1390:
------------------------------------

Thanks Mark !


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org