You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2018/02/23 11:23:00 UTC

[jira] [Created] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

Rupert Westenthaler created LUCENE-8183:
-------------------------------------------

             Summary: HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
                 Key: LUCENE-8183
                 URL: https://issues.apache.org/jira/browse/LUCENE-8183
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 6.6
         Environment: Configuration of the analyzer:

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HyphenationCompoundWordTokenFilterFactory" 
        hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
         dictionary="lang/wordlist_de.txt" 
        onlyLongestMatch="true"/>

 
            Reporter: Rupert Westenthaler


The HyphenationCompoundWordTokenFilter creates overlapping tokens even if onlyLongestMatch is enabled. 

Example:

Dictionary: {{gesellschaft}}, {{schaft}}
 Hyphenator: {{de_DR.xml}} //from Apche Offo
onlyLongestMatch: true

 
|HCWTF|
|
|text|
|raw_bytes|
|start|
|end|
|positionLength|
|type|
|position|
|
|
|
|gesellschaft|
|[67 65 73 65 6c 6c 73 63 68 61 66 74]|
|0|
|12|
|1|
|word|
|1|
|
|
|
|gesellschaft|
|[67 65 73 65 6c 6c 73 63 68 61 66 74]|
|0|
|12|
|1|
|word|
|1|
|
|
|
|schaft|
|[73 63 68 61 66 74]|
|0|
|12|
|1|
|word|
|1|
|
|

IMHO this includes 2 unexpected Tokens
 # the 2nd 'gesellschaft' as it duplicates the original token
 # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the dictionary

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org