You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2006/12/20 23:42:20 UTC

[jira] Created: (SOLR-89) new TokenFilters for whitespace trimming and pattern replacing

new TokenFilters for whitespace trimming and pattern replacing
--------------------------------------------------------------

                 Key: SOLR-89
                 URL: http://issues.apache.org/jira/browse/SOLR-89
             Project: Solr
          Issue Type: New Feature
            Reporter: Hoss Man
         Assigned To: Hoss Man


(note: lumping these in a single issue since i did them both at the same time)

More then one person has asekd me recently about how they can configure strings which:
   a) sort case insensitively
   B) ignore leading (and trailing although it's not as big of an issue) whitespace
   c ) ignore certain characters anywhere in the string (ie: strip punctuation)

The first can be solved already using the KeywordTokenizer in conjunction with the LowerCaseFilter.  I've written a TrimFilter and PatternReplaceFilter to address the later two.  (Strictly speaking, TrimFilter isn't needed since you cna make a pattern thta matches leading or trailing whitespace, but for people who are only interested in the whitespace issue, i'm sure String.trim() is more efficient the a regex)

An example of how they can be used...

    <!-- This is an example of using the KeywordTokenizer along
         With various TokenFilterFactories to produce a sortable field
         that does not include some properties of the source text
      -->
    <fieldtype name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
             input string is preserved as a single token
          -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <!-- The LowerCase TokenFilter does what you expect, which can be
             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />
        <!-- The TrimFilter removes any leading or trailing whitespace -->
        <filter class="solr.TrimFilterFactory" />
        <!-- The PatternReplaceFilter gives you the flexibility to use
             Java Regular expression to replace any sequence of characters
             matching a pattern with an arbitrary replacement string, 
             which may include back refrences to portions of the orriginal
             string matched by the pattern.
             
             See the Java Regular Expression documentation for more
             infomation on pattern and replacement string syntax.
             
             http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
          -->
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        />
      </analyzer>
    </fieldtype>


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (SOLR-89) new TokenFilters for whitespace trimming and pattern replacing

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man resolved SOLR-89.
--------------------------

    Resolution: Fixed

patch commited with a a few small javadoc tweaks and a bit of whitesapce added to one of hte example docs to illustrate PatternReplaceFilter's effects.

> new TokenFilters for whitespace trimming and pattern replacing
> --------------------------------------------------------------
>
>                 Key: SOLR-89
>                 URL: https://issues.apache.org/jira/browse/SOLR-89
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Assigned To: Hoss Man
>         Attachments: pattern-and-trim-filters.patch
>
>
> (note: lumping these in a single issue since i did them both at the same time)
> More then one person has asekd me recently about how they can configure strings which:
>    a) sort case insensitively
>    B) ignore leading (and trailing although it's not as big of an issue) whitespace
>    c ) ignore certain characters anywhere in the string (ie: strip punctuation)
> The first can be solved already using the KeywordTokenizer in conjunction with the LowerCaseFilter.  I've written a TrimFilter and PatternReplaceFilter to address the later two.  (Strictly speaking, TrimFilter isn't needed since you cna make a pattern thta matches leading or trailing whitespace, but for people who are only interested in the whitespace issue, i'm sure String.trim() is more efficient the a regex)
> An example of how they can be used...
>     <!-- This is an example of using the KeywordTokenizer along
>          With various TokenFilterFactories to produce a sortable field
>          that does not include some properties of the source text
>       -->
>     <fieldtype name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
>       <analyzer>
>         <!-- KeywordTokenizer does no actual tokenizing, so the entire
>              input string is preserved as a single token
>           -->
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <!-- The LowerCase TokenFilter does what you expect, which can be
>              when you want your sorting to be case insensitive
>           -->
>         <filter class="solr.LowerCaseFilterFactory" />
>         <!-- The TrimFilter removes any leading or trailing whitespace -->
>         <filter class="solr.TrimFilterFactory" />
>         <!-- The PatternReplaceFilter gives you the flexibility to use
>              Java Regular expression to replace any sequence of characters
>              matching a pattern with an arbitrary replacement string, 
>              which may include back refrences to portions of the orriginal
>              string matched by the pattern.
>              
>              See the Java Regular Expression documentation for more
>              infomation on pattern and replacement string syntax.
>              
>              http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
>           -->
>         <filter class="solr.PatternReplaceFilterFactory"
>                 pattern="([^a-z])" replacement="" replace="all"
>         />
>       </analyzer>
>     </fieldtype>

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (SOLR-89) new TokenFilters for whitespace trimming and pattern replacing

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/SOLR-89?page=all ]

Hoss Man updated SOLR-89:
-------------------------

    Attachment: pattern-and-trim-filters.patch

Patch containing both new Filters, Factories, and test cases.

Feedback would be appreciated, but i'm not in a big rush to commit.

> new TokenFilters for whitespace trimming and pattern replacing
> --------------------------------------------------------------
>
>                 Key: SOLR-89
>                 URL: http://issues.apache.org/jira/browse/SOLR-89
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Assigned To: Hoss Man
>         Attachments: pattern-and-trim-filters.patch
>
>
> (note: lumping these in a single issue since i did them both at the same time)
> More then one person has asekd me recently about how they can configure strings which:
>    a) sort case insensitively
>    B) ignore leading (and trailing although it's not as big of an issue) whitespace
>    c ) ignore certain characters anywhere in the string (ie: strip punctuation)
> The first can be solved already using the KeywordTokenizer in conjunction with the LowerCaseFilter.  I've written a TrimFilter and PatternReplaceFilter to address the later two.  (Strictly speaking, TrimFilter isn't needed since you cna make a pattern thta matches leading or trailing whitespace, but for people who are only interested in the whitespace issue, i'm sure String.trim() is more efficient the a regex)
> An example of how they can be used...
>     <!-- This is an example of using the KeywordTokenizer along
>          With various TokenFilterFactories to produce a sortable field
>          that does not include some properties of the source text
>       -->
>     <fieldtype name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
>       <analyzer>
>         <!-- KeywordTokenizer does no actual tokenizing, so the entire
>              input string is preserved as a single token
>           -->
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <!-- The LowerCase TokenFilter does what you expect, which can be
>              when you want your sorting to be case insensitive
>           -->
>         <filter class="solr.LowerCaseFilterFactory" />
>         <!-- The TrimFilter removes any leading or trailing whitespace -->
>         <filter class="solr.TrimFilterFactory" />
>         <!-- The PatternReplaceFilter gives you the flexibility to use
>              Java Regular expression to replace any sequence of characters
>              matching a pattern with an arbitrary replacement string, 
>              which may include back refrences to portions of the orriginal
>              string matched by the pattern.
>              
>              See the Java Regular Expression documentation for more
>              infomation on pattern and replacement string syntax.
>              
>              http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
>           -->
>         <filter class="solr.PatternReplaceFilterFactory"
>                 pattern="([^a-z])" replacement="" replace="all"
>         />
>       </analyzer>
>     </fieldtype>

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira