You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2014/11/02 22:34:28 UTC
Question about StandardTokenizer in Solr 4.9
Hi all:
>From the description of the StandardTokenizer, it should Recognizes Internet domain names and email addresses and preserves them as a single token, which works great, but I've detected that in cases like this:
socks25.domain.com it outputs 2 tokens: socks25 | domain.com
if the URL doesn't have any numbers:
socks.domain.com it outputs a single token: socks.domain.com
The same happens if the number is not at the end an URL part:
so2cks.domain.com it outputs a single token: so2cks.domain.com
Is this an intended behavior? The odd part is that without the number at the end of an URL part it works fine.
Regards,
Re: Question about StandardTokenizer in Solr 4.9
Posted by Jack Krupansky <ja...@basetechnology.com>.
Yeah, that behavior is consistent with what I documented in my e-book for
Solr. The dot is kept only if between two digits or two letters.
-- Jack Krupansky
-----Original Message-----
From: Jorge Luis BetancourtGonzález
Sent: Sunday, November 2, 2014 4:34 PM
To: solr-user@lucene.apache.org
Subject: Question about StandardTokenizer in Solr 4.9
Hi all:
>From the description of the StandardTokenizer, it should Recognizes Internet
domain names and email addresses and preserves them as a single token, which
works great, but I've detected that in cases like this:
socks25.domain.com it outputs 2 tokens: socks25 | domain.com
if the URL doesn't have any numbers:
socks.domain.com it outputs a single token: socks.domain.com
The same happens if the number is not at the end an URL part:
so2cks.domain.com it outputs a single token: so2cks.domain.com
Is this an intended behavior? The odd part is that without the number at the
end of an URL part it works fine.
Regards,