You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2014/11/02 22:34:28 UTC

Question about StandardTokenizer in Solr 4.9

Hi all:

>From the description of the StandardTokenizer, it should Recognizes Internet domain names and email addresses and preserves them as a single token, which works great, but I've detected that in cases like this:

socks25.domain.com it outputs 2 tokens: socks25 | domain.com

if the URL doesn't have any numbers:

socks.domain.com it outputs a single token: socks.domain.com

The same happens if the number is not at the end an URL part:

so2cks.domain.com it outputs a single token: so2cks.domain.com

Is this an intended behavior? The odd part is that without the number at the end of an URL part it works fine.

Regards,

Re: Question about StandardTokenizer in Solr 4.9

Posted by Jack Krupansky <ja...@basetechnology.com>.

Yeah, that behavior is consistent with what I documented in my e-book for 
Solr. The dot is kept only if between two digits or two letters.

-- Jack Krupansky

-----Original Message----- 
From: Jorge Luis BetancourtGonzález
Sent: Sunday, November 2, 2014 4:34 PM
To: solr-user@lucene.apache.org
Subject: Question about StandardTokenizer in Solr 4.9

Hi all:

>From the description of the StandardTokenizer, it should Recognizes Internet 
domain names and email addresses and preserves them as a single token, which 
works great, but I've detected that in cases like this:

socks25.domain.com it outputs 2 tokens: socks25 | domain.com

if the URL doesn't have any numbers:

socks.domain.com it outputs a single token: socks.domain.com

The same happens if the number is not at the end an URL part:

so2cks.domain.com it outputs a single token: so2cks.domain.com

Is this an intended behavior? The odd part is that without the number at the 
end of an URL part it works fine.

Regards,