You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Greg Preston <gp...@marinsoftware.com> on 2013/09/25 23:43:19 UTC
How to always tokenize on underscore?
[Using SolrCloud 4.4.0]
I have a text field where the data will sometimes be delimited by
whitespace, and sometimes by underscore. For example, both of the
following are possible input values:
Group_EN_1000232142_blah_1000232142abc_foo
Group EN 1000232142 blah 1000232142abc foo
What I'd like to do is have underscores treated as spaces for
tokenization purposes. I've tried using a PatternReplaceFilterFactory
with:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="_"
replacement=" " replace="all" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="_"
replacement=" " replace="all" />
</analyzer>
</fieldType>
but that seems to do the pattern replacement on each token, rather
than splitting tokens into multiple tokens based on the pattern. So
with the input "Group_EN_1000232142_blah_1000232142abc_foo" I end up
with a single token of "group en 1000232142 blah 1000232142abc foo"
rather than what I want, which is 6 tokens: "group", "en",
"1000232142", "blah", "1000232142abc", "foo".
Is there a way to configure for the behavior I'm looking for, or would
I need to write a customer tokenizer?
Thanks!
-Greg
Re: How to always tokenize on underscore?
Posted by Greg Preston <gp...@marinsoftware.com>.
This is exactly what I needed. Thank you!
-Greg
On Wed, Sep 25, 2013 at 2:48 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> Use the char filter instead:
> http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: Greg Preston
> Sent: Wednesday, September 25, 2013 5:43 PM
> To: solr-user@lucene.apache.org
> Subject: How to always tokenize on underscore?
>
>
> [Using SolrCloud 4.4.0]
>
> I have a text field where the data will sometimes be delimited by
> whitespace, and sometimes by underscore. For example, both of the
> following are possible input values:
>
> Group_EN_1000232142_blah_1000232142abc_foo
> Group EN 1000232142 blah 1000232142abc foo
>
> What I'd like to do is have underscores treated as spaces for
> tokenization purposes. I've tried using a PatternReplaceFilterFactory
> with:
>
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="_"
> replacement=" " replace="all" />
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="_"
> replacement=" " replace="all" />
> </analyzer>
> </fieldType>
>
> but that seems to do the pattern replacement on each token, rather
> than splitting tokens into multiple tokens based on the pattern. So
> with the input "Group_EN_1000232142_blah_1000232142abc_foo" I end up
> with a single token of "group en 1000232142 blah 1000232142abc foo"
> rather than what I want, which is 6 tokens: "group", "en",
> "1000232142", "blah", "1000232142abc", "foo".
>
> Is there a way to configure for the behavior I'm looking for, or would
> I need to write a customer tokenizer?
>
> Thanks!
>
> -Greg
Re: How to always tokenize on underscore?
Posted by Jack Krupansky <ja...@basetechnology.com>.
Use the char filter instead:
http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html
-- Jack Krupansky
-----Original Message-----
From: Greg Preston
Sent: Wednesday, September 25, 2013 5:43 PM
To: solr-user@lucene.apache.org
Subject: How to always tokenize on underscore?
[Using SolrCloud 4.4.0]
I have a text field where the data will sometimes be delimited by
whitespace, and sometimes by underscore. For example, both of the
following are possible input values:
Group_EN_1000232142_blah_1000232142abc_foo
Group EN 1000232142 blah 1000232142abc foo
What I'd like to do is have underscores treated as spaces for
tokenization purposes. I've tried using a PatternReplaceFilterFactory
with:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="_"
replacement=" " replace="all" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="_"
replacement=" " replace="all" />
</analyzer>
</fieldType>
but that seems to do the pattern replacement on each token, rather
than splitting tokens into multiple tokens based on the pattern. So
with the input "Group_EN_1000232142_blah_1000232142abc_foo" I end up
with a single token of "group en 1000232142 blah 1000232142abc foo"
rather than what I want, which is 6 tokens: "group", "en",
"1000232142", "blah", "1000232142abc", "foo".
Is there a way to configure for the behavior I'm looking for, or would
I need to write a customer tokenizer?
Thanks!
-Greg