You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Adam Estrada <es...@gmail.com> on 2011/06/17 21:26:04 UTC

REGEX Proper Usage?

All,

I am having trouble getting my regex pattern to work properly. I have tried
PatternReplaceFilterFactory after the standard tokenizer

<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z0-9])"
replacement=" " replace="all"/>

and PatternReplaceCharFilterFactory before it.

<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([^a-zA-Z0-9])" replacement=" " replace="all"/>

It looks like this should work to remove everything except letters and
numbers.

        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_en.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LengthFilterFactory" min="2" max="999"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z0-9])" replacement=" " replace="all"/>

I am left with quite a few facet items like this

<int name="_ view">1443</int>
<int name="view _">1599</int>

Can anyone suggest what may be going on here? I have verified that my regex
works properly here http://www.fileformat.info/tool/regex.htm

Adam

Re: REGEX Proper Usage?

Posted by Dave Searle <da...@magicalia.com>.
Maybe try ([^a-z0-9]+)

Sent by CarrierPigeon

On 17 Jun 2011, at 20:26, Adam Estrada <es...@gmail.com> wrote:

> All,
> 
> I am having trouble getting my regex pattern to work properly. I have tried
> PatternReplaceFilterFactory after the standard tokenizer
> 
> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z0-9])"
> replacement=" " replace="all"/>
> 
> and PatternReplaceCharFilterFactory before it.
> 
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([^a-zA-Z0-9])" replacement=" " replace="all"/>
> 
> It looks like this should work to remove everything except letters and
> numbers.
> 
>        <charFilter class="solr.HTMLStripCharFilterFactory"/>
>        <filter class="solr.ASCIIFoldingFilterFactory"/>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords_en.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.LengthFilterFactory" min="2" max="999"/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z0-9])" replacement=" " replace="all"/>
> 
> I am left with quite a few facet items like this
> 
> <int name="_ view">1443</int>
> <int name="view _">1599</int>
> 
> Can anyone suggest what may be going on here? I have verified that my regex
> works properly here http://www.fileformat.info/tool/regex.htm
> 
> Adam