You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Benji Smith (JIRA)" <ji...@apache.org> on 2015/03/06 22:57:38 UTC
[jira] [Created] (LUCENE-6348) Incorrect results from UAX_URL_EMAIL tokenizer

Benji Smith created LUCENE-6348:
-----------------------------------

             Summary: Incorrect results from UAX_URL_EMAIL tokenizer
                 Key: LUCENE-6348
                 URL: https://issues.apache.org/jira/browse/LUCENE-6348
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
         Environment: Elasticsearch 1.3.4 on Ubuntu 14.04.2
            Reporter: Benji Smith


I'm using an analyzer based on the UAX_URL_EMAIL, with a maximum token length of 64 characters. I expect the analyzer to discard any text in the URL beyond those 64 characters, but the actual results yield ordinary terms from the tail-end of the URL.

For example, 

{code}
curl -XGET http://localhost:9200/my_index/_analyze?analyzer=uax_url_email_analyzer -d "hey, check out http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-death-is-optional for some light reading."
{code}

The results look like this:

{code}
{
    "tokens": [
        {
            "token": "hey",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "check",
            "start_offset": 5,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "out",
            "start_offset": 11,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-d",
            "start_offset": 15,
            "end_offset": 79,
            "type": "<URL>",
            "position": 4
        },
        {
            "token": "eath",
            "start_offset": 79,
            "end_offset": 83,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "is",
            "start_offset": 84,
            "end_offset": 86,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "optional",
            "start_offset": 87,
            "end_offset": 95,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "for",
            "start_offset": 96,
            "end_offset": 99,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "some",
            "start_offset": 100,
            "end_offset": 104,
            "type": "<ALPHANUM>",
            "position": 9
        },
        {
            "token": "light",
            "start_offset": 105,
            "end_offset": 110,
            "type": "<ALPHANUM>",
            "position": 10
        },
        {
            "token": "reading",
            "start_offset": 111,
            "end_offset": 118,
            "type": "<ALPHANUM>",
            "position": 11
        }
    ]
}
{code}

The term from the extracted URL is correct, and correctly truncated at 64 characters. But as you can see, the analysis pipeline also creates three spurious terms [ "eath", "is" "optional" ] which come from the discarded portion of the URL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org