You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/08/09 17:16:15 UTC

[jira] Created: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

remove custom encoding support in Greek/Russian Analyzers
---------------------------------------------------------

                 Key: LUCENE-1793
                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/analyzers
            Reporter: Robert Muir
            Priority: Minor


The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.

I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 

I would like to deprecate/remove the support for these other encodings.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741372#action_12741372 ] 

Earwin Burrfoot commented on LUCENE-1793:
-----------------------------------------

bq. I am guessing the rationale for the current code is to try to reduce index size? (since these languages are double-byte encoded in Unicode). 
Rationale was most probably to support existing non-unicode systems/databases/files, whatever. My say is - anyone still holding onto koi8, cp1251 and friends should silently do harakiri.

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1793:
--------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])
    Fix Version/s: 2.9

Setting to 2.9
I would like to commit in a day or two if there are no objections.

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1793.patch, LUCENE-1793.patch, LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "DM Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741109#action_12741109 ] 

DM Smith commented on LUCENE-1793:
----------------------------------

bq.If this is the concern, then I think a better solution would be to integrate some form of unicode compression (i.e. BOCU-1) into lucene, rather than try to deal with legacy character sets in this way.

So it doesn't get lost, would it be good to open an issue for this? And for alternate encodings?

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741462#action_12741462 ] 

Robert Muir commented on LUCENE-1793:
-------------------------------------

it seems no one is against this, I will clean this up / add friendly deprecation warnings to the patch.


> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1793:
--------------------------------

    Attachment: LUCENE-1793.patch

patch with more javadocs verbage.

When do we want to deprecate these strange encodings? might be too late for 2.9 but I think sooner than later would be best.


> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1793.patch, LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741098#action_12741098 ] 

Uwe Schindler commented on LUCENE-1793:
---------------------------------------

I would also strongly suggest to remove these custom charsets. They are not unicode conform, because they use char codepoint mappings that simply define an US ASCII char for some of the input chars. The problems begin with mixed language texts.
This strange (and wrong) mapping can also be seen in the tests: Tests load a KOI-8 file with encoding ISO-8859-1 (to get the native bytes as chars) and then map it. This is very bad!
The analyzers should really only work on unicode codepoints and nothing more. For backwards compatibility with old indexes (that are encoded using this strange mapping), we have to preserve the charsets for a while, but deprecate all of them and only leave UTF-16 as input (java chars).

You are right, to reduce index size, it would be good, to also support other encodings in addition to UTF-8 for storage of term text.

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1793:
--------------------------------

    Attachment: LUCENE-1793.patch

updated patch with "removed in next release" changed to "removed in Lucene 3.0".

Changes reads:

Deprecate the custom encoding support in the Greek and Russian
Analyzers. If you need to index text in these encodings, please use Java's
character set conversion facilities (InputStreamReader, etc) during I/O,
so that Lucene can analyze this text as Unicode instead.

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1793.patch, LUCENE-1793.patch, LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "DM Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741173#action_12741173 ] 

DM Smith commented on LUCENE-1793:
----------------------------------

I wasn't thinking about any encoding in particular. It was in reference to Uwe's comment: 
bq. You are right, to reduce index size, it would be good, to also support other encodings in addition to UTF-8 for storage of term text.

I think that having other encodings can be problematic. Problems that Unicode solves. But if the idea is worthy of discussion then a new issue would be a better place to house it. 

Regarding BOCU it is a patented algorithm requiring a license from IBM for implementation. I gather that it is part of ICU. Not sure if either is a big deal or not. 

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1793:
--------------------------------

    Attachment: LUCENE-1793.patch

patch that deprecates the custom charsets.

after a release, a lot of code can be completely removed and these will be much simpler.

I can add more verbage to these if needed, but I am suspicious that this stuff is actually working correctly...


> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745456#action_12745456 ] 

Mark Miller commented on LUCENE-1793:
-------------------------------------

We have not started code freeze yet - I'd deprecate if it makes sense.

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1793.patch, LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741111#action_12741111 ] 

Robert Muir commented on LUCENE-1793:
-------------------------------------

good idea, curious what other encodings you had in mind?. I only thought of BOCU because it maintains binary sort order, so it makes sense for an index...

and it looks like there could be a possible performance benefit to something like these as well over UTF-8 (at least for certain languages): http://unicode.org/notes/tn6/#Performance


> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741093#action_12741093 ] 

Robert Muir commented on LUCENE-1793:
-------------------------------------

I am guessing the rationale for the current code is to try to reduce index size? (since these languages are double-byte encoded in Unicode).

If this is the concern, then I think a better solution would be to integrate some form of unicode compression (i.e. BOCU-1) into lucene, rather than try to deal with legacy character sets in this way.


> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741241#action_12741241 ] 

Robert Muir commented on LUCENE-1793:
-------------------------------------

DM, you are right this is a better discussion for another issue/place
I was concerned that we would be taking functionality away, but this is not the case, as Uwe says it is only "strange".

I just looked at all these encodings: they are all storing characters in the extended ascii range (> 0x7F)
Therefore, anyone using this strange encoding support is using 2 bytes per character already! 
For example someone using CP1251 in the russian analyzer is simply storing Ж as 0xC6, its being represented as Æ. (2 bytes in UTF-8)
So, by deprecating these encodings for unicode, nobody's index size will double...


> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reassigned LUCENE-1793:
-----------------------------------

    Assignee: Robert Muir

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1793.patch, LUCENE-1793.patch, LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-1793.
---------------------------------

    Resolution: Fixed

Committed revision 806886.

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1793.patch, LUCENE-1793.patch, LUCENE-1793.patch
>
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org