You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2008/11/23 17:47:44 UTC

[jira] Created: (LUCENE-1466) CharFilter - normalize characters before tokenizer

CharFilter - normalize characters before tokenizer
--------------------------------------------------

                 Key: LUCENE-1466
                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Analysis
    Affects Versions: 2.4
            Reporter: Koji Sekiguchi
            Priority: Minor


This proposes to import CharFilter that has been introduced in Solr 1.4.

Please see for the details:
SOLR-822
http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1466:
-----------------------------------

    Attachment: LUCENE-1466.patch

updated patch attached.
- sync trunk (smart chinese analyzer(LUCENE-1629), etc.)
- added a useful idiom to get ChatStream and make private CharReader constructor

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722435#action_12722435 ] 

Koji Sekiguchi commented on LUCENE-1466:
----------------------------------------

bq. Solr has already committed CharFilter, and had to workaround it not being in Lucene with classes like CharStreamAwareTokenizer, etc. Koji, are you planning to work out a patch for Solr to switch to Lucene's impl?

Yeah, why not! :)

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1466:
-----------------------------------

    Attachment: LUCENE-1466.patch

Added TestMappingCharFilter test case (copied from Solr).

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722329#action_12722329 ] 

Michael McCandless commented on LUCENE-1466:
--------------------------------------------

OK thanks Koji.  I'll add a bit more to the javadocs of BaseCharFilter about the performance caveats.

I plan to commit in a day or too.

Thanks for persisting Koji!

Solr has already committed CharFilter, and had to workaround it not being in Lucene with classes like CharStreamAwareTokenizer, etc.  Koji, are you planning to work out a patch for Solr to switch to Lucene's impl?

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721761#action_12721761 ] 

Michael McCandless commented on LUCENE-1466:
--------------------------------------------

Thanks for the update Koji!

The patch looks good.  Some questions:

  * Can you add a CHANGES entry describing this new feature, as well
    as the change in type of Tokenizer.input?

  * Can we rename NormalizeMap -> NormalizeCharMap?

  * Could you add some javadocs to NormalizeCharMap,
    MappingCharFilter, BaseCharFilter?

  * The BaseCharFilter correct method looks spookily costly (has a for
    loop, going backwards for all added mappings).  It seems like in
    practice it should not be costly, because typically one corrects
    the offset only for the "current" token?  And, one could always
    build their own CharFilter (eg using arrays of ints or something)
    if they needed a more efficient mapping.

  * MappingCharFilter's match method is recursive.  But I think the
    depth of that recursion equals the length of character sequence
    that's being mapped, right?  So risk of stack overlflow should be
    basically zero, unless someone does some insanely long character
    string mappings?


I have some back-compat concerns. First, I see these 2 failures in
"ant test-tag":

{code}
[junit] Testcase: testExclusiveLowerNull(org.apache.lucene.search.TestRangeQuery):	Caused an ERROR
[junit] input
[junit] java.lang.NoSuchFieldError: input
[junit] 	at org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247)
[junit] 	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160)
[junit] 	at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
[junit] 	at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
[junit] 	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773)
[junit] 	at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751)
[junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354)
[junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328)
[junit] 	at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306)
[junit] 	at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289)
[junit] 	at org.apache.lucene.search.TestRangeQuery.testExclusiveLowerNull(TestRangeQuery.java:317)
[junit] 
[junit] 
[junit] Testcase: testInclusiveLowerNull(org.apache.lucene.search.TestRangeQuery):	Caused an ERROR
[junit] input
[junit] java.lang.NoSuchFieldError: input
[junit] 	at org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247)
[junit] 	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160)
[junit] 	at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
[junit] 	at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
[junit] 	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773)
[junit] 	at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751)
[junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354)
[junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328)
[junit] 	at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306)
[junit] 	at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289)
[junit] 	at org.apache.lucene.search.TestRangeQuery.testInclusiveLowerNull(TestRangeQuery.java:351)
{code}

These are JAR drop-inability failures, because the type of
Tokenizer.input changed from Reader to CharStream.  Since CharStream
subclasses Reader, references to Tokenizer.input would be fixed w/ a
simple recompile.

However, assignments to "input" by external subclasses of Tokenizer
will result in compilation error.  You have to replace such
assignments with {{this.input = CharReader.get(input)}}.  Since input
is protected, any subclass can up and assign to it.  The good news is
this'd be a catastrophic compilation error (vs something silent at
runtime); the bad news is that's [unfortunately] against our
back-compat policies.

Any ideas how we can fix this to "migrate" to CharStream without
breaking back compat?


> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1466:
-----------------------------------

    Attachment: LUCENE-1466.patch

renamed correctPosition() to correct() because it is for correcting token offset, not for token position.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718291#action_12718291 ] 

Robert Muir commented on LUCENE-1466:
-------------------------------------

just as an alternative, i have a different mechanism as part of lucene-1488 patch I am working on. But maybe its good to have options, as it depends on the ICU library.

below is excerpt from javadoc.

A TokenFilter that transforms text with ICU.

ICU provides text-transformation functionality via its Transliteration API.
Although script conversion is its most common use, a transliterator can actually perform a more general class of tasks. 
...
Some useful transformations for search are built-in:
* Conversion from Traditional to Simplified Chinese characters
* Conversion from Hiragana to Katakana
* Conversion from Fullwidth to Halfwidth forms.
...
Example usage:
 * stream = new ICUTransformFilter(stream, Transliterator.getInstance("Traditional-Simplified"));


> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718579#action_12718579 ] 

Michael McCandless commented on LUCENE-1466:
--------------------------------------------

I'd like to get this in for 2.9.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718539#action_12718539 ] 

Koji Sekiguchi commented on LUCENE-1466:
----------------------------------------

If I can vote for it, I want it to be part of 2.9. I know several (at least five) companies use this feature in production. We use it via Solr (SOLR-822), but we hope it to be part of Lucene core.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718277#action_12718277 ] 

Mark Miller commented on LUCENE-1466:
-------------------------------------

Anyone want to step up for this one or should we push it off to 3.0?

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721588#action_12721588 ] 

Koji Sekiguchi edited comment on LUCENE-1466 at 6/18/09 7:04 PM:
-----------------------------------------------------------------

updated patch attached.
- sync trunk (smart chinese analyzer(LUCENE-1629), etc.)
- added a useful idiom to get CharStream and make private CharReader constructor

      was (Author: koji):
    updated patch attached.
- sync trunk (smart chinese analyzer(LUCENE-1629), etc.)
- added a useful idiom to get ChatStream and make private CharReader constructor
  
> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722283#action_12722283 ] 

Koji Sekiguchi commented on LUCENE-1466:
----------------------------------------

Oops. Thanks for the updated patch, Mike!
{quote}
    *  Can you add a CHANGES entry describing this new feature, as well
      as the change in type of Tokenizer.input?
    * Can we rename NormalizeMap -> NormalizeCharMap?
    * Could you add some javadocs to NormalizeCharMap,
      MappingCharFilter, BaseCharFilter?
{quote}
Your patch looks good!
{quote}
    * The BaseCharFilter correct method looks spookily costly (has a for
      loop, going backwards for all added mappings). It seems like in
      practice it should not be costly, because typically one corrects
      the offset only for the "current" token? And, one could always
      build their own CharFilter (eg using arrays of ints or something)
      if they needed a more efficient mapping.
{quote}
Yes, users can create their own CharFilter if they needed a more efficient mapping.
{quote}
    * MappingCharFilter's match method is recursive. But I think the
      depth of that recursion equals the length of character sequence
      that's being mapped, right? So risk of stack overlflow should be
      basically zero, unless someone does some insanely long character
      string mappings?
{quote}
You are correct.

{quote}
I think we should make an exception to back-compat here, and simply
change TokenStream.input from Reader to CharStream (subclasses
Reader). Properly respecting back-compat will be alot of work, and,
if external subclasses are directly assigning to input, they really
ought to be using reaset(Reader) instead. 
{quote}
I agree with you, Mike.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1466:
---------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
    Fix Version/s: 2.9

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1466:
------------------------------------------

    Assignee: Michael McCandless

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1466:
---------------------------------------

    Attachment: LUCENE-1466-back.patch
                LUCENE-1466.patch

I think we should make an exception to back-compat here, and simply
change TokenStream.input from Reader to CharStream (subclasses
Reader).  Properly respecting back-compat will be alot of work, and,
if external subclasses are directly assigning to input, they really
ought to be using reaset(Reader) instead.

I updated the patch with the above issues, fixed some whitespace
issues, added Tokenizer.reset(CharStream) and patched back-compat.


> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1466:
-----------------------------------

    Attachment: LUCENE-1466-TestCharFilter.patch

an additional test for CharFilter that I forgot to move from Solr... Mike, can you commit this? Thank you. :)

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466-TestCharFilter.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723500#action_12723500 ] 

Michael McCandless commented on LUCENE-1466:
--------------------------------------------

bq. an additional test for CharFilter that I forgot to move from Solr... Mike, can you commit this? Thank you.

Done... thanks!

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466-TestCharFilter.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723374#action_12723374 ] 

Michael McCandless commented on LUCENE-1466:
--------------------------------------------

Woops, sorry, I did indeed see your new patch and applied it but then failed to svn add.  OK I just committed them.  Thanks!

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723334#action_12723334 ] 

Koji Sekiguchi commented on LUCENE-1466:
----------------------------------------

Thank you Mike for committing this! I'll open a ticket for Solr soon. BTW, I cannot see TestMappingCharFilter that is in the latest patch I attached. Is there a problem in the test or just slipped over?

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1466:
-----------------------------------

      Description: 
This proposes to import CharFilter that has been introduced in Solr 1.4.

Please see for the details:
- SOLR-822
- http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

  was:
This proposes to import CharFilter that has been introduced in Solr 1.4.

Please see for the details:
SOLR-822
http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

    Lucene Fields: [New, Patch Available]  (was: [New])

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1466:
-----------------------------------

    Attachment: LUCENE-1466.patch

a patch attached.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> SOLR-822
> http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1466.
----------------------------------------

    Resolution: Fixed

OK I just committed this.  Thanks Koji!  Can you open a Solr issue & work out a patch so Solr can cutover to this?  Thanks.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org