You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Benji Smith (JIRA)" <ji...@apache.org> on 2015/03/06 22:57:38 UTC
[jira] [Created] (LUCENE-6348) Incorrect results from UAX_URL_EMAIL
tokenizer
Benji Smith created LUCENE-6348:
-----------------------------------
Summary: Incorrect results from UAX_URL_EMAIL tokenizer
Key: LUCENE-6348
URL: https://issues.apache.org/jira/browse/LUCENE-6348
Project: Lucene - Core
Issue Type: Bug
Components: modules/analysis
Environment: Elasticsearch 1.3.4 on Ubuntu 14.04.2
Reporter: Benji Smith
I'm using an analyzer based on the UAX_URL_EMAIL, with a maximum token length of 64 characters. I expect the analyzer to discard any text in the URL beyond those 64 characters, but the actual results yield ordinary terms from the tail-end of the URL.
For example,
{code}
curl -XGET http://localhost:9200/my_index/_analyze?analyzer=uax_url_email_analyzer -d "hey, check out http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-death-is-optional for some light reading."
{code}
The results look like this:
{code}
{
"tokens": [
{
"token": "hey",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "check",
"start_offset": 5,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "out",
"start_offset": 11,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-d",
"start_offset": 15,
"end_offset": 79,
"type": "<URL>",
"position": 4
},
{
"token": "eath",
"start_offset": 79,
"end_offset": 83,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "is",
"start_offset": 84,
"end_offset": 86,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "optional",
"start_offset": 87,
"end_offset": 95,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "for",
"start_offset": 96,
"end_offset": 99,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "some",
"start_offset": 100,
"end_offset": 104,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "light",
"start_offset": 105,
"end_offset": 110,
"type": "<ALPHANUM>",
"position": 10
},
{
"token": "reading",
"start_offset": 111,
"end_offset": 118,
"type": "<ALPHANUM>",
"position": 11
}
]
}
{code}
The term from the extracted URL is correct, and correctly truncated at 64 characters. But as you can see, the analysis pipeline also creates three spurious terms [ "eath", "is" "optional" ] which come from the discarded portion of the URL.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org