You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Tom Van Cuyck <to...@ontoforce.com> on 2021/06/11 11:49:47 UTC

ClassicTokenizer not working as expected

Hi,

I have an issue with the ClassicTokenizer. According to the documentation (
https://solr.apache.org/guide/8_8/tokenizers.html#classic-tokenizer) this
should work as follows:

- Words are split at hyphens, unless there is a number in the word, in
which case the token is not split and the numbers and hyphen(s) are
preserved.

If I run the analysis on 'abc-123' it properly returns a single token.
However if I enter 'abc-def-123' it returns 2 tokens: 'abc' and 'def-123'
which is unexpected to me.

Is there a tokenizer or setting that can keep this as a single token?

[image: image.png]
As a secondary minor question: the first token 'abc' is of type <ALPHANUM>
while the second token 'def-123' is of type <NUM>. Why is the second token
not of the type <ALPHANUM>? I looked for more information on these types
but could not find any.

Kind regards, Tom

-- 


[image: ONTOFORCE - Links for lives] <https://www.ontoforce.com/>
Tom Van Cuyck
Software Engineer
[image: Online] www.ontoforce.com or visit the contact page
<https://www.ontoforce.com/contact/>
[image: Phone] BE +32 9 292 80 37 <003292928037>   /   US +1 617 315 9650
<0016173159650>
[image: LinkedIn] <https://www.linkedin.com/company/ontoforce/> [image:
Twitter] <https://twitter.com/ONTOFORCE>

Would you like to receive our Newsletter to stay updated?

Subscribe here <http://eepurl.com/dwoymH>
DISCLAIMER This message (including any attachments) may contain information
which is confidential and/or protected by intellectual property rights and
is intended for the sole use of the recipient(s) named above. Any use of
the information herein (including, but not limited to, total or partial
reproduction, communication or distribution in any form) by persons other
than the designated recipient(s) is prohibited. If you have received it by
mistake, please notify the sender by return email and delete this message
from your system. Please note that emails are susceptible to change.
ONTOFORCE shall not be liable for the improper or incomplete transmission
of the information contained in this communication nor for any delay in its
receipt or damage to your system. ONTOFORCE does not guarantee that the
integrity of this communication is free of viruses interceptions or
interference.

Re: ClassicTokenizer not working as expected

Posted by Shawn Heisey <ap...@elyograg.org>.

On 2021-06-11 05:49, Tom Van Cuyck wrote:
> I have an issue with the ClassicTokenizer. According to the
> documentation
> (https://solr.apache.org/guide/8_8/tokenizers.html#classic-tokenizer)
> this should work as follows:
> 
> - Words are split at hyphens, unless there is a number in the word, in
> which case the token is not split and the numbers and hyphen(s) are
> preserved.
> 
> If I run the analysis on 'abc-123' it properly returns a single token.
> However if I enter 'abc-def-123' it returns 2 tokens: 'abc' and
> 'def-123' which is unexpected to me.
> 
> Is there a tokenizer or setting that can keep this as a single token?

As ClassicTokenizer is a Lucene class, the javadoc is there:

https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html

And that says what you found in the Solr docs.  I suspect that it's 
doing exactly as advertised.  It sees the first part and emits the token 
"abc" ... then continues on.  Then when it is working on the next part, 
it sees the number after the delimiter and the documented behavior where 
numbers are concerned kicks in.

Getting the tokenizer to look ahead through multiple delimiters to do 
what you're expecting would probably be a lot harder than it sounds.  
I'm not an expert in analyzer code, though.

I do not have any idea about the token type.  That does sound a little 
bit wrong, but I can't speak for the code author's intent.

Thanks,
Shawn