You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Adam Estrada <es...@gmail.com> on 2011/06/17 21:26:04 UTC
REGEX Proper Usage?
All,
I am having trouble getting my regex pattern to work properly. I have tried
PatternReplaceFilterFactory after the standard tokenizer
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z0-9])"
replacement=" " replace="all"/>
and PatternReplaceCharFilterFactory before it.
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([^a-zA-Z0-9])" replacement=" " replace="all"/>
It looks like this should work to remove everything except letters and
numbers.
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LengthFilterFactory" min="2" max="999"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z0-9])" replacement=" " replace="all"/>
I am left with quite a few facet items like this
<int name="_ view">1443</int>
<int name="view _">1599</int>
Can anyone suggest what may be going on here? I have verified that my regex
works properly here http://www.fileformat.info/tool/regex.htm
Adam
Re: REGEX Proper Usage?
Posted by Dave Searle <da...@magicalia.com>.
Maybe try ([^a-z0-9]+)
Sent by CarrierPigeon
On 17 Jun 2011, at 20:26, Adam Estrada <es...@gmail.com> wrote:
> All,
>
> I am having trouble getting my regex pattern to work properly. I have tried
> PatternReplaceFilterFactory after the standard tokenizer
>
> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z0-9])"
> replacement=" " replace="all"/>
>
> and PatternReplaceCharFilterFactory before it.
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([^a-zA-Z0-9])" replacement=" " replace="all"/>
>
> It looks like this should work to remove everything except letters and
> numbers.
>
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords_en.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.LengthFilterFactory" min="2" max="999"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z0-9])" replacement=" " replace="all"/>
>
> I am left with quite a few facet items like this
>
> <int name="_ view">1443</int>
> <int name="view _">1599</int>
>
> Can anyone suggest what may be going on here? I have verified that my regex
> works properly here http://www.fileformat.info/tool/regex.htm
>
> Adam