You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Cédrik LIME (JIRA)" <ji...@apache.org> on 2009/10/29 18:02:59 UTC

[jira] Created: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
--------------------------------------------------------------------------------------

                 Key: LUCENE-2015
                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Cédrik LIME
            Priority: Minor


This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.

It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786710#action_12786710 ] 

Mark Miller commented on LUCENE-2015:
-------------------------------------

For this type of stuff "no one has complained" doesn't mean much - thats why these changes are so insidious - they are easy not to notice - docs just disappear, and users likely don't know they ever existed. For some apps this is absolutely disastrous.

We prob should have been more careful with 1351 and more careful in the future.

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Cédrik LIME (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771557#action_12771557 ] 

Cédrik LIME commented on LUCENE-2015:
-------------------------------------

Indeed, and that was my primary (internal) patch.
But then you loose the shared "output" buffer between incrementToken() calls, and you end up creating char[]'s like there is no tomorrow, which may be a performance regression.

What I can do is /add/ a static method that operates on a char[], for convenient external use.
What do you think?

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reassigned LUCENE-2015:
-----------------------------------

    Assignee: Robert Muir

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch, LUCENE-2015.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Cédrik LIME (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cédrik LIME updated LUCENE-2015:
--------------------------------

    Attachment: ASCIIFoldingFilter-no_formatting.patch

As suggested by Robert, here is a new version of the ASCIIFoldingFilter patch which exposes the folding logic.
I have added 2 convenience methods that can operate on a char[] and on a CharSequence.

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Cédrik LIME (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cédrik LIME updated LUCENE-2015:
--------------------------------

    Attachment: Filters.patch

(UTF-8 encoding)

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: Filters.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Cédrik LIME (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cédrik LIME updated LUCENE-2015:
--------------------------------

    Attachment: LUCENE-2015.patch

Robert: I liked the dual approach (fold 1 {{char}} / a {{char[]}}) as it offered maximum flexibility (folding a String didn't incur a systematic copy of the input as {{toCharArray()}} does, I could use {{charAt()}} in a loop).
Nevertheless, I will be happy with a single method if this is your preferred approach.

I have updated your patch slightly to model the API after {{System.arraycopy()}}, which makes it a bit more flexible and easier to use:
* added offset for output
* shuffled the arguments order to mimic {{System.arraycopy()}}
* updated JavaDoc

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch, LUCENE-2015.patch, LUCENE-2015.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Cédrik LIME (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843539#action_12843539 ] 

Cédrik LIME commented on LUCENE-2015:
-------------------------------------

Robert, any news on this patch? Can we get it applied for Lucene 3.1?

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2015.
---------------------------------

    Resolution: Fixed

Committed revision 922277.

Thanks Cédrik!

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch, LUCENE-2015.patch, LUCENE-2015.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771524#action_12771524 ] 

Robert Muir commented on LUCENE-2015:
-------------------------------------

Cédrik, in my opinion, it would be easier to see the patch without the formatting changes if possible.

Even if there is bad indentation currently, I think this should be corrected in a separate patch.


> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: Filters.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771546#action_12771546 ] 

Robert Muir commented on LUCENE-2015:
-------------------------------------

Cédrik,

I think adding the idea of adding a public static method for folding is OK. but I think it should essentially do what foldToAscii does, not operate on a single 'char'.

we should avoid single 'char' as parameter arguments, instead it should work on the entire char[] I think ?

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771592#action_12771592 ] 

Uwe Schindler commented on LUCENE-2015:
---------------------------------------

I would leave ISOLatin1AccentFilter as it is. No version logic for already deprecated classes, they are deprecated, so no support any more. Normally we would have removed it in 3.0, it is really only be there to support old indexes, so no new features. If until now, nobody complained, we do not need to care. Maybe the modifications were so special, that only some of the term in such indexes were affected and nobody realized that difference.

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2015:
--------------------------------

    Attachment: LUCENE-2015.patch

Cédrik: i brought the patch up to date, but modified it slightly.

The reasoning is, I would prefer if we just expose one method in this case.


> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch, LUCENE-2015.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843593#action_12843593 ] 

Robert Muir commented on LUCENE-2015:
-------------------------------------

Thanks Cédrik, I like your latest change.

My primary reasoning for minimizing the API is because each exposed 
method has some cost to us (backwards compatibility).

I think if someone wants to fold a String they can still work with this API,
e.g. use a char[1] container, and not even bother if charAt() < 0x7F, etc.

In general I guess i am less concerned about this as the Lucene API 
doesn't use String.

I will commit in a day or two if no one objects.


> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch, LUCENE-2015.patch, LUCENE-2015.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771578#action_12771578 ] 

Robert Muir commented on LUCENE-2015:
-------------------------------------

bq. ISOLatin1AccentFilter was already modified in Lucene 2.4: see LUCENE-1351

that's interesting, so if someone has a < Lucene 2.4 index built with this filter, its currently not compatible... 
I guess no one has complained but there could be some conditional logic based on Version to support those indexes...

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Cédrik LIME (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cédrik LIME updated LUCENE-2015:
--------------------------------

    Attachment: ASCIIFoldingFilter-no_formatting.patch
                ISOLatin1AccentFilter.patch

Here are the patches (UTF-8 encoding), 1 per filter.
I have removed the formatting on the switch(c) in ASCIIFoldingFilter for easier review.

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771481#action_12771481 ] 

Robert Muir commented on LUCENE-2015:
-------------------------------------

Cédrik, is it possible to provide a patch without the formatting changes?

I am having trouble seeing the changes you made to ASCIIFoldingFilter.

btw, I think ISOLatin1AccentFilter only stays around for back compat to support old indexes, in my opinion we should not modify it for this reason.

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: Filters.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Cédrik LIME (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771576#action_12771576 ] 

Cédrik LIME commented on LUCENE-2015:
-------------------------------------

Uwe,

ISOLatin1AccentFilter was already modified in Lucene 2.4: see LUCENE-1351

As for ASCIIFoldingFilter, I will take a second shot at an expert API next week. Stay tuned!

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771593#action_12771593 ] 

Michael McCandless commented on LUCENE-2015:
--------------------------------------------

I think those changes to ISOLatin1AccentFilter predated our Version logic... I agree that had Version been around we probably should have used it.

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2015:
--------------------------------

    Fix Version/s: 3.1

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch, LUCENE-2015.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771568#action_12771568 ] 

Uwe Schindler commented on LUCENE-2015:
---------------------------------------

We cannot apply the patch to ISOLatin1Filter, as it would break indexes already using it. Because of that we migrated to ASCIIFoldingFilter and kept ISOLatin1Filter alive. So we should leave it as it is.

To the buffer problem: For easy external use we could also provide a expert API that works like the current public foldToASCII method, which is memory efficient. But may also provide String/StringBuilder converters for external use. Internal it cannot be better as it currently is :-)

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Cédrik LIME (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771496#action_12771496 ] 

Cédrik LIME commented on LUCENE-2015:
-------------------------------------

Robert,

All I did is refactor the big switch(c) into its own method:
  public static final int foldToASCII(char c, char[] output, int outputPos)
and change the caller (public void foldToASCII(char[] input, int length)) accordingly.

I can submit a patch without formatting changes, but that means the source won't be nicely indented...
Please advise.

As for the ISOLatin1AccentFilter patch, it really is to enable us to remove a workaround for an issue we had with some special (yet frequent) chars. Feel free to ignore it should you think this part is not relevant.

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: Filters.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773014#action_12773014 ] 

Robert Muir commented on LUCENE-2015:
-------------------------------------

Cédrik, thanks!

at a glance this looks good to me... can look at it more thoroughly later, i am heading out of town.

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771564#action_12771564 ] 

Robert Muir commented on LUCENE-2015:
-------------------------------------

Cédrik, why would you create char[]'s like there is no tomorrow if you add a static method that operates on char[], for external use, but also use this within the incrementToken(), passing the tokenBuffer as an argument?

> ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2015
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Cédrik LIME
>            Priority: Minor
>         Attachments: ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org