You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Junte Zhang <Ju...@localsearch.ch> on 2018/04/26 08:15:23 UTC

UAX29URLEmailTokenizer is not detecting the correct URL token type

Hi all,


We are using the UAX29URLEmailTokenizer so we can use the token types in our plugins.


However, I noticed that the tokenizer is not detecting certain URLs as <URL> but <ALPHANUM> instead.


Examples that are not working:


example.com is <ALPHANUM>

example.net is <ALPHANUM>


But:

https://example.com is <URL> as is https://example.net.


Examples that work:

example.ch is <URL>

example.co.uk is <URL>

example.nl is <URL>


I have checked the JIRA, and could not find an issue. I have tested this on Lucene (Solr) 6.4.1 and 7.3.


Could someone confirm my findings and advise what I could do to (help) resolve this issue?


/JZ

<https://example.com>