You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Robert Hume <rh...@gmail.com> on 2017/05/24 20:19:48 UTC
[Simplified my question] How to enhance solr.StandardTokenizerFactory?
(was: Why is Standard Tokenizer not separating at this comma?)
Hi,
Following up on my last email question ... I've learned more and I
simplified by question ...
I have a Solr 3.6 deployment. Currently I'm using
solr.StandardTokenizerFactory to parse tokens during indexing.
Here's two example streams that demonstrate my issue:
Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|`
... which is good.
Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|`
... which is not good because users can't search by "000123".
It seems StandardTokenizerFactory treats the "6,000" differently (like it's
currency or a product number, maybe?) so it doesn't tokenize at the comma.
QUESTION: How can I enhance StandardTokenizer to do everything it's doing
now plus produce a couple of additional tokens like this ...
`bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|`
... so users can search by "000123"?
Thanks!
Rob
Re: [Simplified my question] How to enhance
solr.StandardTokenizerFactory? (was: Why is Standard Tokenizer not separating
at this comma?)
Posted by Steve Rowe <sa...@gmail.com>.
Hi Robert,
Two possibilities come to mind:
1. Use a char filter factory (runs before the tokenizer) to convert commas between digits to spaces, e.g. PatternReplaceCharFilterFactory <https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory>.
2. Use WordDelimiterFilterFactory <https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter>
--
Steve
www.lucidworks.com
> On May 24, 2017, at 4:19 PM, Robert Hume <rh...@gmail.com> wrote:
>
> Hi,
>
> Following up on my last email question ... I've learned more and I
> simplified by question ...
>
> I have a Solr 3.6 deployment. Currently I'm using
> solr.StandardTokenizerFactory to parse tokens during indexing.
>
> Here's two example streams that demonstrate my issue:
>
> Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|`
> ... which is good.
>
> Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|`
> ... which is not good because users can't search by "000123".
>
> It seems StandardTokenizerFactory treats the "6,000" differently (like it's
> currency or a product number, maybe?) so it doesn't tokenize at the comma.
>
> QUESTION: How can I enhance StandardTokenizer to do everything it's doing
> now plus produce a couple of additional tokens like this ...
>
> `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|`
>
> ... so users can search by "000123"?
>
> Thanks!
> Rob