You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/08/19 15:50:16 UTC

[jira] Created: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.

Allow customizing how WordDelimiterFilter tokenizes text.
---------------------------------------------------------

                 Key: SOLR-2059
                 URL: https://issues.apache.org/jira/browse/SOLR-2059
             Project: Solr
          Issue Type: New Feature
          Components: Schema and Analysis
            Reporter: Robert Muir
            Priority: Minor
             Fix For: 3.1, 4.0
         Attachments: SOLR-2059.patch

By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
Based on these types and the options provided, it splits and concatenates text.

In some circumstances, you might need to tweak the behavior of how this works.
It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
But its not exposed in the factory.

I think you should be able to customize the defaults with a configuration file:
{noformat}
# A customized type mapping for WordDelimiterFilterFactory
# the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
# 
# the default for any character without a mapping is always computed from 
# Unicode character properties

# Map the $, %, '.', and ',' characters to DIGIT 
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
. => DIGIT
\u002C => DIGIT
{noformat}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902593#action_12902593 ] 

Robert Muir commented on SOLR-2059:
-----------------------------------

Hi Peter:

thats a great example. For my use case it was actually not the example either, but I was just trying to give a good general example.

What do you think of the file format, is it ok for describing these categories? 
This format/parser is just stolen the one from MappingCharFilterFactory, it seemed unambiguous and is already in use.

As far as applying the patch, you need to apply it to https://svn.apache.org/repos/asf/lucene/dev/trunk, not https://svn.apache.org/repos/asf/lucene/dev/trunk/solr

This is because it has to modify a file in modules, too.

> Allow customizing how WordDelimiterFilter tokenizes text.
> ---------------------------------------------------------
>
>                 Key: SOLR-2059
>                 URL: https://issues.apache.org/jira/browse/SOLR-2059
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2059.patch
>
>
> By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
> Based on these types and the options provided, it splits and concatenates text.
> In some circumstances, you might need to tweak the behavior of how this works.
> It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
> But its not exposed in the factory.
> I think you should be able to customize the defaults with a configuration file:
> {noformat}
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
> # 
> # the default for any character without a mapping is always computed from 
> # Unicode character properties
> # Map the $, %, '.', and ',' characters to DIGIT 
> # This might be useful for financial data.
> $ => DIGIT
> % => DIGIT
> . => DIGIT
> \u002C => DIGIT
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902891#action_12902891 ] 

Robert Muir commented on SOLR-2059:
-----------------------------------

Thanks for the feedback. I'd like to commit (to trunk and 3x) in a few days if no one objects.


> Allow customizing how WordDelimiterFilter tokenizes text.
> ---------------------------------------------------------
>
>                 Key: SOLR-2059
>                 URL: https://issues.apache.org/jira/browse/SOLR-2059
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2059.patch
>
>
> By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
> Based on these types and the options provided, it splits and concatenates text.
> In some circumstances, you might need to tweak the behavior of how this works.
> It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
> But its not exposed in the factory.
> I think you should be able to customize the defaults with a configuration file:
> {noformat}
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
> # 
> # the default for any character without a mapping is always computed from 
> # Unicode character properties
> # Map the $, %, '.', and ',' characters to DIGIT 
> # This might be useful for financial data.
> $ => DIGIT
> % => DIGIT
> . => DIGIT
> \u002C => DIGIT
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved SOLR-2059.
-------------------------------

    Resolution: Fixed

Committed revision 990451 (trunk) 990456 (3x)

> Allow customizing how WordDelimiterFilter tokenizes text.
> ---------------------------------------------------------
>
>                 Key: SOLR-2059
>                 URL: https://issues.apache.org/jira/browse/SOLR-2059
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2059.patch
>
>
> By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
> Based on these types and the options provided, it splits and concatenates text.
> In some circumstances, you might need to tweak the behavior of how this works.
> It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
> But its not exposed in the factory.
> I think you should be able to customize the defaults with a configuration file:
> {noformat}
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
> # 
> # the default for any character without a mapping is always computed from 
> # Unicode character properties
> # Map the $, %, '.', and ',' characters to DIGIT 
> # This might be useful for financial data.
> $ => DIGIT
> % => DIGIT
> . => DIGIT
> \u002C => DIGIT
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2059:
------------------------------

    Attachment: SOLR-2059.patch

> Allow customizing how WordDelimiterFilter tokenizes text.
> ---------------------------------------------------------
>
>                 Key: SOLR-2059
>                 URL: https://issues.apache.org/jira/browse/SOLR-2059
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2059.patch
>
>
> By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
> Based on these types and the options provided, it splits and concatenates text.
> In some circumstances, you might need to tweak the behavior of how this works.
> It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
> But its not exposed in the factory.
> I think you should be able to customize the defaults with a configuration file:
> {noformat}
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
> # 
> # the default for any character without a mapping is always computed from 
> # Unicode character properties
> # Map the $, %, '.', and ',' characters to DIGIT 
> # This might be useful for financial data.
> $ => DIGIT
> % => DIGIT
> . => DIGIT
> \u002C => DIGIT
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.

Posted by "Peter Karich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902588#action_12902588 ] 

Peter Karich commented on SOLR-2059:
------------------------------------

Robert,

thanks for this work! I have a different application for this patch: in a twitter search # and @ shouldn't be removed. Instead I will handle them like ALPHA, I think.

Would you mind to update the patch for the latest version of the trunk? I got a problem with WordDelimiterIterator at line 254 if I am using https://svn.apache.org/repos/asf/lucene/dev/trunk/solr and a file is missing problem (line 37) for http://svn.apache.org/repos/asf/solr

> Allow customizing how WordDelimiterFilter tokenizes text.
> ---------------------------------------------------------
>
>                 Key: SOLR-2059
>                 URL: https://issues.apache.org/jira/browse/SOLR-2059
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2059.patch
>
>
> By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
> Based on these types and the options provided, it splits and concatenates text.
> In some circumstances, you might need to tweak the behavior of how this works.
> It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
> But its not exposed in the factory.
> I think you should be able to customize the defaults with a configuration file:
> {noformat}
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
> # 
> # the default for any character without a mapping is always computed from 
> # Unicode character properties
> # Map the $, %, '.', and ',' characters to DIGIT 
> # This might be useful for financial data.
> $ => DIGIT
> % => DIGIT
> . => DIGIT
> \u002C => DIGIT
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Issue Comment Edited: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.

Posted by "Peter Karich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902600#action_12902600 ] 

Peter Karich edited comment on SOLR-2059 at 8/25/10 3:46 PM:
-------------------------------------------------------------

Ups, my mistake ... this helped!

> What do you think of the file format, is it ok for describing these categories? 

I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar="@#" which is now more powerful IMHO:
{code} 
@ => ALPHA
# => ALPHA
{code} 


      was (Author: peathal):
    Ups, my mistake ... this helped!

> What do you think of the file format, is it ok for describing these categories? 

I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar="@#" which is now more powerful IMHO:

@ => ALPHA
# => ALPHA


  
> Allow customizing how WordDelimiterFilter tokenizes text.
> ---------------------------------------------------------
>
>                 Key: SOLR-2059
>                 URL: https://issues.apache.org/jira/browse/SOLR-2059
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2059.patch
>
>
> By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
> Based on these types and the options provided, it splits and concatenates text.
> In some circumstances, you might need to tweak the behavior of how this works.
> It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
> But its not exposed in the factory.
> I think you should be able to customize the defaults with a configuration file:
> {noformat}
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
> # 
> # the default for any character without a mapping is always computed from 
> # Unicode character properties
> # Map the $, %, '.', and ',' characters to DIGIT 
> # This might be useful for financial data.
> $ => DIGIT
> % => DIGIT
> . => DIGIT
> \u002C => DIGIT
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.

Posted by "Peter Karich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902600#action_12902600 ] 

Peter Karich commented on SOLR-2059:
------------------------------------

Ups, my mistake ... this helped!

> What do you think of the file format, is it ok for describing these categories? 

I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar="@#" which is now more powerful IMHO:

@ => ALPHA
# => ALPHA



> Allow customizing how WordDelimiterFilter tokenizes text.
> ---------------------------------------------------------
>
>                 Key: SOLR-2059
>                 URL: https://issues.apache.org/jira/browse/SOLR-2059
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2059.patch
>
>
> By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
> Based on these types and the options provided, it splits and concatenates text.
> In some circumstances, you might need to tweak the behavior of how this works.
> It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
> But its not exposed in the factory.
> I think you should be able to customize the defaults with a configuration file:
> {noformat}
> # A customized type mapping for WordDelimiterFilterFactory
> # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
> # 
> # the default for any character without a mapping is always computed from 
> # Unicode character properties
> # Map the $, %, '.', and ',' characters to DIGIT 
> # This might be useful for financial data.
> $ => DIGIT
> % => DIGIT
> . => DIGIT
> \u002C => DIGIT
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org