You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2014/03/16 05:49:16 UTC
[jira] [Updated] (LUCENE-3022) DictionaryCompoundWordTokenFilter
Flag onlyLongestMatch has no affect
[ https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley updated LUCENE-3022:
---------------------------------
Fix Version/s: (was: 4.7)
4.8
> DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
> ---------------------------------------------------------------------
>
> Key: LUCENE-3022
> URL: https://issues.apache.org/jira/browse/LUCENE-3022
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 2.9.4, 3.1
> Reporter: Johann Höchtl
> Assignee: Robert Muir
> Priority: Minor
> Labels: dead
> Fix For: 4.8
>
> Attachments: LUCENE-3022.patch, LUCENE-3022.patch
>
> Original Estimate: 5m
> Remaining Estimate: 5m
>
> When using the DictionaryCompoundWordTokenFilter with a german dictionary, I got a strange behaviour:
> The german word "streifenbluse" (blouse with stripes) was decompounded to "streifen" (stripe),"reifen"(tire) which makes no sense at all.
> I thought the flag onlyLongestMatch would fix this, because "streifen" is longer than "reifen", but it had no effect.
> So I reviewed the sourcecode and found the problem:
> [code]
> protected void decomposeInternal(final Token token) {
> // Only words longer than minWordSize get processed
> if (token.length() < this.minWordSize) {
> return;
> }
>
> char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
>
> for (int i=0;i<token.length()-this.minSubwordSize;++i) {
> Token longestMatchToken=null;
> for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
> if(i+j>token.length()) {
> break;
> }
> if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
> if (this.onlyLongestMatch) {
> if (longestMatchToken!=null) {
> if (longestMatchToken.length()<j) {
> longestMatchToken=createToken(i,j,token);
> }
> } else {
> longestMatchToken=createToken(i,j,token);
> }
> } else {
> tokens.add(createToken(i,j,token));
> }
> }
> }
> if (this.onlyLongestMatch && longestMatchToken!=null) {
> tokens.add(longestMatchToken);
> }
> }
> }
> [/code]
> should be changed to
> [code]
> protected void decomposeInternal(final Token token) {
> // Only words longer than minWordSize get processed
> if (token.termLength() < this.minWordSize) {
> return;
> }
> char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
> Token longestMatchToken=null;
> for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {
> for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
> if(i+j>token.termLength()) {
> break;
> }
> if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
> if (this.onlyLongestMatch) {
> if (longestMatchToken!=null) {
> if (longestMatchToken.termLength()<j) {
> longestMatchToken=createToken(i,j,token);
> }
> } else {
> longestMatchToken=createToken(i,j,token);
> }
> } else {
> tokens.add(createToken(i,j,token));
> }
> }
> }
> }
> if (this.onlyLongestMatch && longestMatchToken!=null) {
> tokens.add(longestMatchToken);
> }
> }
> [/code]
> So, that only the longest token is really indexed and the onlyLongestMatch Flag makes sense.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org