You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Updated) (JIRA)" <ji...@apache.org> on 2012/03/06 03:35:57 UTC
[jira] [Updated] (LUCENE-3022) DictionaryCompoundWordTokenFilter
Flag onlyLongestMatch has no affect
[ https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-3022:
--------------------------------
Fix Version/s: (was: 3.6)
> DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
> ---------------------------------------------------------------------
>
> Key: LUCENE-3022
> URL: https://issues.apache.org/jira/browse/LUCENE-3022
> Project: Lucene - Java
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 2.9.4, 3.1
> Reporter: Johann Höchtl
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3022.patch, LUCENE-3022.patch
>
> Original Estimate: 5m
> Remaining Estimate: 5m
>
> When using the DictionaryCompoundWordTokenFilter with a german dictionary, I got a strange behaviour:
> The german word "streifenbluse" (blouse with stripes) was decompounded to "streifen" (stripe),"reifen"(tire) which makes no sense at all.
> I thought the flag onlyLongestMatch would fix this, because "streifen" is longer than "reifen", but it had no effect.
> So I reviewed the sourcecode and found the problem:
> [code]
> protected void decomposeInternal(final Token token) {
> // Only words longer than minWordSize get processed
> if (token.length() < this.minWordSize) {
> return;
> }
>
> char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
>
> for (int i=0;i<token.length()-this.minSubwordSize;++i) {
> Token longestMatchToken=null;
> for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
> if(i+j>token.length()) {
> break;
> }
> if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
> if (this.onlyLongestMatch) {
> if (longestMatchToken!=null) {
> if (longestMatchToken.length()<j) {
> longestMatchToken=createToken(i,j,token);
> }
> } else {
> longestMatchToken=createToken(i,j,token);
> }
> } else {
> tokens.add(createToken(i,j,token));
> }
> }
> }
> if (this.onlyLongestMatch && longestMatchToken!=null) {
> tokens.add(longestMatchToken);
> }
> }
> }
> [/code]
> should be changed to
> [code]
> protected void decomposeInternal(final Token token) {
> // Only words longer than minWordSize get processed
> if (token.termLength() < this.minWordSize) {
> return;
> }
> char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
> Token longestMatchToken=null;
> for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {
> for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
> if(i+j>token.termLength()) {
> break;
> }
> if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
> if (this.onlyLongestMatch) {
> if (longestMatchToken!=null) {
> if (longestMatchToken.termLength()<j) {
> longestMatchToken=createToken(i,j,token);
> }
> } else {
> longestMatchToken=createToken(i,j,token);
> }
> } else {
> tokens.add(createToken(i,j,token));
> }
> }
> }
> }
> if (this.onlyLongestMatch && longestMatchToken!=null) {
> tokens.add(longestMatchToken);
> }
> }
> [/code]
> So, that only the longest token is really indexed and the onlyLongestMatch Flag makes sense.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org