You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Gregg Donovan <gr...@gmail.com> on 2009/09/14 22:52:46 UTC

Difficulty with Multi-Word Synonyms

I'm running into an odd issue with multi-word synonyms in Solr (using
the latest [9/14/09] nightly ). Things generally seem to work as
expected, but I sometimes see words that are the leading term in a
multi-word synonym being replaced with the token that follows them in
the stream when they should just be ignored (i.e. there's no synonym
match for just that token). When I preview the analysis at
admin/analysis.jsp it looks fine, but at runtime I see problems like
the one in the unit test below. It's a simple case, so I assume I'm
making some sort of configuration and/or usage error.

package org.apache.solr.analysis;
import java.io.*;
import java.util.*;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

public class TestMultiWordSynonmys extends junit.framework.TestCase {

  public void testMultiWordSynonmys() throws IOException {
    List<String> rules = new ArrayList<String>();
    rules.add( "a b c,d" );
    SynonymMap synMap = new SynonymMap( true );
    SynonymFilterFactory.parseRules( rules, synMap, "=>", ",", true, null);

    SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
StringReader("a e")), synMap );
    TermAttribute termAtt = (TermAttribute)
ts.getAttribute(TermAttribute.class);

    ts.reset();
    List<String> tokens = new ArrayList<String>();
    while (ts.incrementToken()) tokens.add( termAtt.term() );

    // This fails because ["e","e"] is the value of the token stream
    assertEquals(Arrays.asList("a","e"), tokens);
  }
}

Any help would be much appreciated. Thanks.

--Gregg

Re: Difficulty with Multi-Word Synonyms

Posted by Robert Muir <rc...@gmail.com>.

thank you again for the bug report with test case!

Is there a recommended workaround that avoids combining the new and old
> APIs?


if you aren't able to patch lucene, maybe apply this workaround patch to
your solr.
this will dodge the problem for your case, by forcing it to only use
next(Token) api.

Index: src/java/org/apache/solr/analysis/SynonymFilter.java
===================================================================
--- src/java/org/apache/solr/analysis/SynonymFilter.java    (revision
816467)
+++ src/java/org/apache/solr/analysis/SynonymFilter.java    (working copy)
@@ -179,7 +179,8 @@
     SynonymMap result = null;

     if (map.submap != null) {
-      Token tok = nextTok();
+      Token tok = new Token();
+      tok = nextTok(tok);
       if (tok != null) {
         // check for positionIncrement!=1?  if>1, should not match, if==0,
check multiple at this level?
         SynonymMap subMap = map.submap.get(tok.termBuffer(), 0,
tok.termLength());


-- 
Robert Muir
rcmuir@gmail.com

Re: Difficulty with Multi-Word Synonyms

Posted by Gregg Donovan <gr...@gmail.com>.

Thanks. And thanks for the help -- we're hoping to switch from query-time to
index-time synonym expansion for all of the reasons listed on the
wiki<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46>,
so this will be great to resolve.

I created SOLR-1445 <https://issues.apache.org/jira/browse/SOLR-1445>,
though the problem seems to be caused by
LUCENE-1919<https://issues.apache.org/jira/browse/LUCENE-1919>,
as you noted.

Is there a recommended workaround that avoids combining the new and old
APIs? Would a version of SynonymFilter that also implemented
incrementToken() be helpful?

--Gregg

On Thu, Sep 17, 2009 at 7:38 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Thu, Sep 17, 2009 at 6:29 PM, Lance Norskog <go...@gmail.com> wrote:
> > Please add a Jira issue for this. It will get more attention there.
> >
> > BTW, thanks for creating such a precise bug report.
>
> +1
>
> Thanks, I had missed this.  This is serious, and looks due to a Lucene
> back compat break.
> I've added the testcase and can confirm the bug.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> > On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan <gr...@gmail.com>
> wrote:
> >> I'm running into an odd issue with multi-word synonyms in Solr (using
> >> the latest [9/14/09] nightly ). Things generally seem to work as
> >> expected, but I sometimes see words that are the leading term in a
> >> multi-word synonym being replaced with the token that follows them in
> >> the stream when they should just be ignored (i.e. there's no synonym
> >> match for just that token). When I preview the analysis at
> >> admin/analysis.jsp it looks fine, but at runtime I see problems like
> >> the one in the unit test below. It's a simple case, so I assume I'm
> >> making some sort of configuration and/or usage error.
> >>
> >> package org.apache.solr.analysis;
> >> import java.io.*;
> >> import java.util.*;
> >> import org.apache.lucene.analysis.WhitespaceTokenizer;
> >> import org.apache.lucene.analysis.tokenattributes.TermAttribute;
> >>
> >> public class TestMultiWordSynonmys extends junit.framework.TestCase {
> >>
> >>   public void testMultiWordSynonmys() throws IOException {
> >>     List<String> rules = new ArrayList<String>();
> >>     rules.add( "a b c,d" );
> >>     SynonymMap synMap = new SynonymMap( true );
> >>     SynonymFilterFactory.parseRules( rules, synMap, "=>", ",", true,
> null);
> >>
> >>     SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
> >> StringReader("a e")), synMap );
> >>     TermAttribute termAtt = (TermAttribute)
> >> ts.getAttribute(TermAttribute.class);
> >>
> >>     ts.reset();
> >>     List<String> tokens = new ArrayList<String>();
> >>     while (ts.incrementToken()) tokens.add( termAtt.term() );
> >>
> >>    // This fails because ["e","e"] is the value of the token stream
> >>     assertEquals(Arrays.asList("a","e"), tokens);
> >>   }
> >> }
> >>
> >> Any help would be much appreciated. Thanks.
> >>
> >> --Gregg
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>

Re: Difficulty with Multi-Word Synonyms

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Sep 17, 2009 at 6:29 PM, Lance Norskog <go...@gmail.com> wrote:
> Please add a Jira issue for this. It will get more attention there.
>
> BTW, thanks for creating such a precise bug report.

+1

Thanks, I had missed this.  This is serious, and looks due to a Lucene
back compat break.
I've added the testcase and can confirm the bug.

-Yonik
http://www.lucidimagination.com



> On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan <gr...@gmail.com> wrote:
>> I'm running into an odd issue with multi-word synonyms in Solr (using
>> the latest [9/14/09] nightly ). Things generally seem to work as
>> expected, but I sometimes see words that are the leading term in a
>> multi-word synonym being replaced with the token that follows them in
>> the stream when they should just be ignored (i.e. there's no synonym
>> match for just that token). When I preview the analysis at
>> admin/analysis.jsp it looks fine, but at runtime I see problems like
>> the one in the unit test below. It's a simple case, so I assume I'm
>> making some sort of configuration and/or usage error.
>>
>> package org.apache.solr.analysis;
>> import java.io.*;
>> import java.util.*;
>> import org.apache.lucene.analysis.WhitespaceTokenizer;
>> import org.apache.lucene.analysis.tokenattributes.TermAttribute;
>>
>> public class TestMultiWordSynonmys extends junit.framework.TestCase {
>>
>>   public void testMultiWordSynonmys() throws IOException {
>>     List<String> rules = new ArrayList<String>();
>>     rules.add( "a b c,d" );
>>     SynonymMap synMap = new SynonymMap( true );
>>     SynonymFilterFactory.parseRules( rules, synMap, "=>", ",", true, null);
>>
>>     SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
>> StringReader("a e")), synMap );
>>     TermAttribute termAtt = (TermAttribute)
>> ts.getAttribute(TermAttribute.class);
>>
>>     ts.reset();
>>     List<String> tokens = new ArrayList<String>();
>>     while (ts.incrementToken()) tokens.add( termAtt.term() );
>>
>>    // This fails because ["e","e"] is the value of the token stream
>>     assertEquals(Arrays.asList("a","e"), tokens);
>>   }
>> }
>>
>> Any help would be much appreciated. Thanks.
>>
>> --Gregg
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Difficulty with Multi-Word Synonyms

Posted by Lance Norskog <go...@gmail.com>.

Please add a Jira issue for this. It will get more attention there.

BTW, thanks for creating such a precise bug report.

On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan <gr...@gmail.com> wrote:
> I'm running into an odd issue with multi-word synonyms in Solr (using
> the latest [9/14/09] nightly ). Things generally seem to work as
> expected, but I sometimes see words that are the leading term in a
> multi-word synonym being replaced with the token that follows them in
> the stream when they should just be ignored (i.e. there's no synonym
> match for just that token). When I preview the analysis at
> admin/analysis.jsp it looks fine, but at runtime I see problems like
> the one in the unit test below. It's a simple case, so I assume I'm
> making some sort of configuration and/or usage error.
>
> package org.apache.solr.analysis;
> import java.io.*;
> import java.util.*;
> import org.apache.lucene.analysis.WhitespaceTokenizer;
> import org.apache.lucene.analysis.tokenattributes.TermAttribute;
>
> public class TestMultiWordSynonmys extends junit.framework.TestCase {
>
>   public void testMultiWordSynonmys() throws IOException {
>     List<String> rules = new ArrayList<String>();
>     rules.add( "a b c,d" );
>     SynonymMap synMap = new SynonymMap( true );
>     SynonymFilterFactory.parseRules( rules, synMap, "=>", ",", true, null);
>
>     SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
> StringReader("a e")), synMap );
>     TermAttribute termAtt = (TermAttribute)
> ts.getAttribute(TermAttribute.class);
>
>     ts.reset();
>     List<String> tokens = new ArrayList<String>();
>     while (ts.incrementToken()) tokens.add( termAtt.term() );
>
>    // This fails because ["e","e"] is the value of the token stream
>     assertEquals(Arrays.asList("a","e"), tokens);
>   }
> }
>
> Any help would be much appreciated. Thanks.
>
> --Gregg
>



-- 
Lance Norskog
goksron@gmail.com