You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2009/12/13 17:53:18 UTC

[jira] Created: (SOLR-1653) add PatternReplaceCharFilter

add PatternReplaceCharFilter
----------------------------

                 Key: SOLR-1653
                 URL: https://issues.apache.org/jira/browse/SOLR-1653
             Project: Solr
          Issue Type: New Feature
          Components: Schema and Analysis
    Affects Versions: 1.4
            Reporter: Koji Sekiguchi
            Priority: Minor
             Fix For: 1.5


Add a new CharFilter that uses a regular expression for the target of replace string in char stream.

Usage:
{code:title=schema.xml}
<fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
  <analyzer>
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                groupedPattern="([nN][oO]\.)\s*(\d+)"
                replaceGroups="1,2" blockDelimiters=":;"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798271#action_12798271 ] 

Koji Sekiguchi commented on SOLR-1653:
--------------------------------------

Thanks, Paul! I've just committed revision 897357.

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch, SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790056#action_12790056 ] 

Koji Sekiguchi edited comment on SOLR-1653 at 12/14/09 9:30 AM:
----------------------------------------------------------------

Ok. I'll show you same samples ;-)

||INPUT||groupedPattern||replaceGroups||OUTPUT||comment||
|see-ing looking|(\w+)(ing)|1|see-ing look|remove "ing" from the end of word|
|see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted|
|No.1 NO. no.  543|[nN][oO]\.\s*(\d+)|{#},1|#1	NO.	#543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern|
|abc=1234=5678|(\w+)=(\d+)=(\d+)|3,{=},1,{=},2|5678=abc=1234|change the order of the groups|


      was (Author: koji):
    Ok. I'll show you same samples ;-)

||INPUT||groupedPattern||replaceGroups||OUTPUT||comment||
|see-ing looking|(\w+)(ing)|1|see-ing look|remove "ing" from the end of word|
|see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted|
|No.1 NO. no.  543|[nN][oO]\.\s*(\d+)|{#},1|#1	NO.	#543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern|
|abc-1234-5678|(\w+)=(\d+)=(\d+)|3,{=},1,{=},2|5678=abc=1234|change the order of the groups|

  
> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790067#action_12790067 ] 

Noble Paul commented on SOLR-1653:
----------------------------------

I guess this can be achieved with the matcher#replaceAll() directly 

input = see-ing looking
regex = (\w+)(ing)
replaceWith = $1

input = abc=1234=5678
regex =(\w+)=(\d+)=(\d+)
replaceWith=$3=$1=$2



> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790577#action_12790577 ] 

Shalin Shekhar Mangar commented on SOLR-1653:
---------------------------------------------

bq. If there is no objections, I'll commit later today.

+1

Thanks Koji!

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch, SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi resolved SOLR-1653.
----------------------------------

    Resolution: Fixed

Committed revision 890798. Thanks Shalin and Noble for taking time to review the patch.

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch, SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790056#action_12790056 ] 

Koji Sekiguchi commented on SOLR-1653:
--------------------------------------

Ok. I'll show you same samples ;-)

||INPUT||groupedPattern||replaceGroups||OUTPUT||comment||
|see-ing looking|(\w+)(ing)|1|see-ing look|remove "ing" from the end of word|
|see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted|
|No.1 NO. no.  543|[nN][oO]\.\s*(\d+)|{#},1|#1	NO.	#543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern|
|abc-1234-5678|(\w+)-(\d+)-(\d+)|3,{-},1,{-},2|5678-abc-1234|change the order of the groups|


> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790572#action_12790572 ] 

Koji Sekiguchi commented on SOLR-1653:
--------------------------------------

I see that existing "PatternReplaceFilter" (not CharFilter) is using "pattern". But it uses "replacement", not "replaceWith". I think I use "pattern" and "replacement".

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch, SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790056#action_12790056 ] 

Koji Sekiguchi edited comment on SOLR-1653 at 12/14/09 9:28 AM:
----------------------------------------------------------------

Ok. I'll show you same samples ;-)

||INPUT||groupedPattern||replaceGroups||OUTPUT||comment||
|see-ing looking|(\w+)(ing)|1|see-ing look|remove "ing" from the end of word|
|see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted|
|No.1 NO. no.  543|[nN][oO]\.\s*(\d+)|{#},1|#1	NO.	#543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern|
|abc-1234-5678|(\w+)=(\d+)=(\d+)|3,{=},1,{=},2|5678-abc-1234|change the order of the groups|


      was (Author: koji):
    Ok. I'll show you same samples ;-)

||INPUT||groupedPattern||replaceGroups||OUTPUT||comment||
|see-ing looking|(\w+)(ing)|1|see-ing look|remove "ing" from the end of word|
|see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted|
|No.1 NO. no.  543|[nN][oO]\.\s*(\d+)|{#},1|#1	NO.	#543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern|
|abc-1234-5678|(\w+)--(\d+)--(\d+)|3,{--},1,{--},2|5678-abc-1234|change the order of the groups|

  
> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790056#action_12790056 ] 

Koji Sekiguchi edited comment on SOLR-1653 at 12/14/09 9:29 AM:
----------------------------------------------------------------

Ok. I'll show you same samples ;-)

||INPUT||groupedPattern||replaceGroups||OUTPUT||comment||
|see-ing looking|(\w+)(ing)|1|see-ing look|remove "ing" from the end of word|
|see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted|
|No.1 NO. no.  543|[nN][oO]\.\s*(\d+)|{#},1|#1	NO.	#543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern|
|abc-1234-5678|(\w+)=(\d+)=(\d+)|3,{=},1,{=},2|5678=abc=1234|change the order of the groups|


      was (Author: koji):
    Ok. I'll show you same samples ;-)

||INPUT||groupedPattern||replaceGroups||OUTPUT||comment||
|see-ing looking|(\w+)(ing)|1|see-ing look|remove "ing" from the end of word|
|see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted|
|No.1 NO. no.  543|[nN][oO]\.\s*(\d+)|{#},1|#1	NO.	#543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern|
|abc-1234-5678|(\w+)=(\d+)=(\d+)|3,{=},1,{=},2|5678-abc-1234|change the order of the groups|

  
> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated SOLR-1653:
---------------------------------

    Attachment: SOLR-1653.patch

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790129#action_12790129 ] 

Noble Paul commented on SOLR-1653:
----------------------------------

bq.I need to process one match at a time.

I guess regex can process one match at a time. 

The most important point is that , we don't need to educate the users on this new  syntax. (I am still not clear about the syntax) . No need to write any parsing code and maintain it 

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790056#action_12790056 ] 

Koji Sekiguchi edited comment on SOLR-1653 at 12/14/09 9:27 AM:
----------------------------------------------------------------

Ok. I'll show you same samples ;-)

||INPUT||groupedPattern||replaceGroups||OUTPUT||comment||
|see-ing looking|(\w+)(ing)|1|see-ing look|remove "ing" from the end of word|
|see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted|
|No.1 NO. no.  543|[nN][oO]\.\s*(\d+)|{#},1|#1	NO.	#543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern|
|abc-1234-5678|(\w+)--(\d+)--(\d+)|3,{--},1,{--},2|5678-abc-1234|change the order of the groups|


      was (Author: koji):
    Ok. I'll show you same samples ;-)

||INPUT||groupedPattern||replaceGroups||OUTPUT||comment||
|see-ing looking|(\w+)(ing)|1|see-ing look|remove "ing" from the end of word|
|see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted|
|No.1 NO. no.  543|[nN][oO]\.\s*(\d+)|{#},1|#1	NO.	#543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern|
|abc-1234-5678|(\w+)-(\d+)-(\d+)|3,{-},1,{-},2|5678-abc-1234|change the order of the groups|

  
> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790565#action_12790565 ] 

Noble Paul commented on SOLR-1653:
----------------------------------

In Solr we refer to Regular Expression Strings as 'regex' . If you think 'pattern' is ok , go ahead.

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch, SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Paul taylor (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797601#action_12797601 ] 

Paul taylor commented on SOLR-1653:
-----------------------------------

Hi, Im using in non Solr in an analyser, and think there maybe a performance issue because you cannot pass a compiled Pattern. In the reusableTokenStream() method you cannot reset a charfilter like you can a tokenizer so it as to recompile the pattern everytime 

i.e. 
 public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
        SavedStreams streams = (SavedStreams)getPreviousTokenStream();
        if (streams == null) {
            streams = new SavedStreams();
            setPreviousTokenStream(streams);
            streams.tokenStream = new StandardTokenizer(Version.LUCENE_CURRENT,new PatternReplaceCharFilter("(no\\.) ([0-9]+)","$1$2,reader));
            streams.filteredTokenStream = new StandardFilter(streams.filteredTokenStream);
            streams.filteredTokenStream = new AccentFilter(streams.filteredTokenStream);
            streams.filteredTokenStream = new LowercaseFilter(streams.filteredTokenStream);
        }
        else {
            streams.tokenStream.reset(new PatternReplaceCharFilter("(no\\.) ([0-9]+)","$1$2",reader));
        }
        return streams.filteredTokenStream;
    }

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch, SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated SOLR-1653:
---------------------------------

    Attachment: SOLR-1653.patch

Excuse myself, because I tried to correct offset per group in a match when I started the first patch, I introduced my own syntax. But, yes, now I've implemented the offset correction per match, so I can use standard syntax. Here is the new patch.

Usage:
{code:title=schema.xml}
<fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
  <analyzer>
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="([nN][oO]\.)\s*(\d+)"
                replaceWith="$1$2"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>
{code}

If there is no objections, I'll commit later today.

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch, SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790127#action_12790127 ] 

Koji Sekiguchi commented on SOLR-1653:
--------------------------------------

bq. I guess this can be achieved with the matcher#replaceAll() directly 

You're right if we don't correct offset of the output char stream. I need to process one match at a time.

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789957#action_12789957 ] 

Koji Sekiguchi commented on SOLR-1653:
--------------------------------------

I'll commit in a few days.

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi reassigned SOLR-1653:
------------------------------------

    Assignee: Koji Sekiguchi

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790026#action_12790026 ] 

Shalin Shekhar Mangar commented on SOLR-1653:
---------------------------------------------

Koji, even after reading through the test, I do not understand how to use it. Are the characters in curly braces, written down for non-groups only? What if I want to remove one particular group?

It is always good to write a use-case and an example in the issue description itself.

> add PatternReplaceCharFilter
> ----------------------------
>
>                 Key: SOLR-1653
>                 URL: https://issues.apache.org/jira/browse/SOLR-1653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1653.patch
>
>
> Add a new CharFilter that uses a regular expression for the target of replace string in char stream.
> Usage:
> {code:title=schema.xml}
> <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.PatternReplaceCharFilterFactory"
>                 groupedPattern="([nN][oO]\.)\s*(\d+)"
>                 replaceGroups="1,2" blockDelimiters=":;"/>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.