You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Atul (JIRA)" <ji...@apache.org> on 2018/02/20 14:20:00 UTC
[jira] [Created] (LUCENE-8181) WordDelimiterTokenFilter does not generate all tokens appropriately

Atul created LUCENE-8181:
----------------------------

             Summary: WordDelimiterTokenFilter does not generate all tokens appropriately
                 Key: LUCENE-8181
                 URL: https://issues.apache.org/jira/browse/LUCENE-8181
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 7.2.1
         Environment: *Steps to reproduce*:
*1. Create index*
_PUT testindex_
_{_
_"settings" : {_
_"index" : {_
_"number_of_shards" : 2,_
_"number_of_replicas" : 2_
_},_
_"analysis": {_
_"filter": {_
_"wordDelimiter": {_
_"type": "word_delimiter",_
_"generate_word_parts": "true",_
_"generate_number_parts": "true",_
_"catenate_words": "false",_
_"catenate_numbers": "false",_
_"catenate_all": "false",_
_"split_on_case_change": "true",_
_"preserve_original": "true",_
_"split_on_numerics": "true",_
_"stem_english_possessive": "true"_
_}_
_},_
_"analyzer": {_
_"content_analyzer": {_
_"type": "custom",_
_"tokenizer": "whitespace",_
_"filter": [_
_"asciifolding",_
_"wordDelimiter",_
_"lowercase"_
_]_
_}_
_}_
_}_
_}_
_}_

*2. Analyze Text-*
_POST testindex/_analyze_
_{_
_"analyzer": "content_analyzer",_
_"text": "ElasticSearch.TestProject"_
_}_

*Following tokens are generated-*

{
"token": "elasticsearch-testproject",
"start_offset": 0,
"end_offset": 25,
"type": "word",
"position": 0
}
,
{
"token": "elastic",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
}
,
{
"token": "search",
"start_offset": 7,
"end_offset": 13,
"type": "word",
"position": 1
}
,
{
"token": "test",
"start_offset": 14,
"end_offset": 18,
"type": "word",
"position": 2
}
,
{
"token": "project",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 3
}

*Expected Result:*
Besides the above tokens even elasticsearch and testproject should be generated. such that the phrase query "elasticsearch testproject" should also match.

*Another example could be-*
Text *"Super-Duper-0-AutoCoder"* with above analyzer generates a token *autocoder* while the text *"Super-Duper-AutoCoder"* does NOT generate the token *autocoder*.
            Reporter: Atul


When using word delimiter token filter some expected tokens are not generated.

When I try to analyze the text "ElasticSearch.TestProject"

I expect the tokens elastic, search, test, project, elasticsearch, testproject, elasticsearch.testproject to be generated since I have split_on_case_change, split_on_numerics on and using a whitespace tokenizer and have preserve original true.

But Actually I only see following tokens -
elasticsearch.testproject, elastic, search, test, project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org