You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Junte Zhang <Ju...@localsearch.ch> on 2018/04/26 08:15:23 UTC
UAX29URLEmailTokenizer is not detecting the correct URL token type
Hi all,
We are using the UAX29URLEmailTokenizer so we can use the token types in our plugins.
However, I noticed that the tokenizer is not detecting certain URLs as <URL> but <ALPHANUM> instead.
Examples that are not working:
example.com is <ALPHANUM>
example.net is <ALPHANUM>
But:
https://example.com is <URL> as is https://example.net.
Examples that work:
example.ch is <URL>
example.co.uk is <URL>
example.nl is <URL>
I have checked the JIRA, and could not find an issue. I have tested this on Lucene (Solr) 6.4.1 and 7.3.
Could someone confirm my findings and advise what I could do to (help) resolve this issue?
/JZ
<https://example.com>