You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Robert Hume <rh...@gmail.com> on 2017/05/24 20:19:48 UTC

[Simplified my question] How to enhance solr.StandardTokenizerFactory? (was: Why is Standard Tokenizer not separating at this comma?)

Hi,

Following up on my last email question ... I've learned more and I
simplified by question ...

I have a Solr 3.6 deployment.  Currently I'm using
solr.StandardTokenizerFactory to parse tokens during indexing.

Here's two example streams that demonstrate my issue:

Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|`
... which is good.

Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|`
... which is not good because users can't search by "000123".

It seems StandardTokenizerFactory treats the "6,000" differently (like it's
currency or a product number, maybe?) so it doesn't tokenize at the comma.

QUESTION: How can I enhance StandardTokenizer to do everything it's doing
now plus produce a couple of additional tokens like this ...

`bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|`

... so users can search by "000123"?

Thanks!
Rob

Re: [Simplified my question] How to enhance solr.StandardTokenizerFactory? (was: Why is Standard Tokenizer not separating at this comma?)

Posted by Steve Rowe <sa...@gmail.com>.
Hi Robert,

Two possibilities come to mind:

1. Use a char filter factory (runs before the tokenizer) to convert commas between digits to spaces, e.g. PatternReplaceCharFilterFactory <https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory>.
2. Use WordDelimiterFilterFactory <https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter>

--
Steve
www.lucidworks.com

> On May 24, 2017, at 4:19 PM, Robert Hume <rh...@gmail.com> wrote:
> 
> Hi,
> 
> Following up on my last email question ... I've learned more and I
> simplified by question ...
> 
> I have a Solr 3.6 deployment.  Currently I'm using
> solr.StandardTokenizerFactory to parse tokens during indexing.
> 
> Here's two example streams that demonstrate my issue:
> 
> Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|`
> ... which is good.
> 
> Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|`
> ... which is not good because users can't search by "000123".
> 
> It seems StandardTokenizerFactory treats the "6,000" differently (like it's
> currency or a product number, maybe?) so it doesn't tokenize at the comma.
> 
> QUESTION: How can I enhance StandardTokenizer to do everything it's doing
> now plus produce a couple of additional tokens like this ...
> 
> `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|`
> 
> ... so users can search by "000123"?
> 
> Thanks!
> Rob