You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Robert Hume <rh...@gmail.com> on 2017/05/24 19:05:05 UTC

Why is Standard Tokenizer not separating at this comma?

I have a Solr 3.6 deployment I inherited.

The schema.xml specifies the use of StandardTokenizerFactory like so ...

    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
    ...
      <tokenizer class="solr.StandardTokenizerFactory"/>
    ...


According to this reference guide (
https://home.apache.org/~ctargett/RefGuidePOC/jekyll/Tokenizers.html) ...
the StandardTokenizer will treat punctuation as a delimiters.


However, here is my content that gets indexed:

    "IOM-1:BA9ATS0FAB,\"Company Name

Module\",8.1.0.16.0.2,B-A,000006KB09029932,PASS,,0,0,0,Y:0,0,0,0,0:BA9AUT0FAB,\"Company
CM Rear Module\",B-6,000009XP12133407,"



This piece `B-A,000006KB09029932` gets tokenized into two words ... `|B-A|`
and `|000006KB09029932|`.


But this piece `B-6,000009XP12133407` gets tokenized into one word ...
`|B-6,000009XP12133407|`.

What I've observed is the comma is not considered a delimiter when it is
proceeded by a digit ... almost like it considers "6,000" to be currency or
something?


QUESTION: Is this a bug in StandardTokenizer, or do I misunderstand how
commas are used as delimiters?

Rob

Re: Why is Standard Tokenizer not separating at this comma?

Posted by Steve Rowe <sa...@gmail.com>.

Hi Robert,

The StandardTokenizer implements the word boundaries rules from UAX#29 <http://unicode.org/reports/tr29/#Word_Boundaries>, discarding anything between boundaries that is exclusively non-alphanumeric (e.g. punctuation).

--
Steve
www.lucidworks.com

> On May 24, 2017, at 3:05 PM, Robert Hume <rh...@gmail.com> wrote:
> 
> I have a Solr 3.6 deployment I inherited.
> 
> The schema.xml specifies the use of StandardTokenizerFactory like so ...
> 
>    <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>    ...
>      <tokenizer class="solr.StandardTokenizerFactory"/>
>    ...
> 
> 
> According to this reference guide (
> https://home.apache.org/~ctargett/RefGuidePOC/jekyll/Tokenizers.html) ...
> the StandardTokenizer will treat punctuation as a delimiters.
> 
> 
> However, here is my content that gets indexed:
> 
>    "IOM-1:BA9ATS0FAB,\"Company Name
> 
> Module\",8.1.0.16.0.2,B-A,000006KB09029932,PASS,,0,0,0,Y:0,0,0,0,0:BA9AUT0FAB,\"Company
> CM Rear Module\",B-6,000009XP12133407,"
> 
> 
> 
> This piece `B-A,000006KB09029932` gets tokenized into two words ... `|B-A|`
> and `|000006KB09029932|`.
> 
> 
> But this piece `B-6,000009XP12133407` gets tokenized into one word ...
> `|B-6,000009XP12133407|`.
> 
> What I've observed is the comma is not considered a delimiter when it is
> proceeded by a digit ... almost like it considers "6,000" to be currency or
> something?
> 
> 
> QUESTION: Is this a bug in StandardTokenizer, or do I misunderstand how
> commas are used as delimiters?
> 
> Rob