You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Tobias <mt...@btinternet.com> on 2017/05/11 01:08:50 UTC

newbie question re solr.PatternReplaceFilterFactory

I am sure this is very simple but I cannot get the pattern right.

How can I use solr.PatternReplaceFilterFactory to remove all words in brackets from being indexed?

eg [ignore this]

thanks

Michael


Re: newbie question re solr.PatternReplaceFilterFactory

Posted by Erick Erickson <er...@gmail.com>.
First use PatternReplaceCharFilterFactory. The difference is that
PatternReplaceCharFilterFactoryworks on the entire input whereas
PatternReplaceFilterFactory works only on the tokens emitted by the
tokenizer. Concrete example using WhitespeceTokenizerFactory would be
this [is some ] text
PatternReplaceFilterFactory would see 5 tokens, "this", "[is", "some",
"]", and "text". So it would be very hard to do what you want.

patternReplaceCharFilterFactory will see the entire input as one
string and operate on it, _then" send it through the tokenizer.

And also don't be fooled by the fact that the _stored_ data will still
contain the removed words. So when you get the doc back from solr
you'll see the original input, brackets and all. In the above example,
if you returned the field you'd still see

this [is some ] text

when the doc matched. This doc would be found when searching for
"this" or "text", but _not_ when searching for "is" or "some".

You want some pattern like
      <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\[.*?\]" replacement=" "/>

Best,
Erick

On Wed, May 10, 2017 at 6:08 PM, Michael Tobias <mt...@btinternet.com> wrote:
> I am sure this is very simple but I cannot get the pattern right.
>
> How can I use solr.PatternReplaceFilterFactory to remove all words in brackets from being indexed?
>
> eg [ignore this]
>
> thanks
>
> Michael
>