You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2011/04/14 02:40:06 UTC

[jira] [Commented] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect

    [ https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019637#comment-13019637 ] 

Robert Muir commented on LUCENE-3022:
-------------------------------------

This sounds like a bug, do you want to try your hand at contributing a patch?

See http://wiki.apache.org/lucene-java/HowToContribute for some instructions.


> DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3022
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3022
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Johann Höchtl
>            Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When using the DictionaryCompoundWordTokenFilter with a german dictionary, I got a strange behaviour:
> The german word "streifenbluse" (blouse with stripes) was decompounded to "streifen" (stripe),"reifen"(tire) which makes no sense at all.
> I thought the flag onlyLongestMatch would fix this, because "streifen" is longer than "reifen", but it had no effect.
> So I reviewed the sourcecode and found the problem:
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.length() < this.minWordSize) {
>       return;
>     }
>     
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
>     
>     for (int i=0;i<token.length()-this.minSubwordSize;++i) {
>         Token longestMatchToken=null;
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.length()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.length()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             } 
>         }
>         if (this.onlyLongestMatch && longestMatchToken!=null) {
>           tokens.add(longestMatchToken);
>         }
>     }
>   }
> [/code]
> should be changed to 
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.termLength() < this.minWordSize) {
>       return;
>     }
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
>     Token longestMatchToken=null;
>     for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.termLength()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.termLength()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             }
>         }
>     }
>     if (this.onlyLongestMatch && longestMatchToken!=null) {
>         tokens.add(longestMatchToken);
>     }
>   }
> [/code]
> So, that only the longest token is really indexed and the onlyLongestMatch Flag makes sense.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org