You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brandon Fish <br...@gmail.com> on 2011/12/15 21:07:41 UTC

Is there an issue with hypens in SpellChecker with StandardTokenizer?

I am getting an error using the SpellChecker component with the query
"another-test"
java.lang.StringIndexOutOfBoundsException: String index out of range: -7

This appears to be related to this
issue<https://issues.apache.org/jira/browse/SOLR-1630> which
has been marked as fixed. My configuration and test case that follows
appear to reproduce the error I am seeing. Both "another" and "test" get
changed into tokens with start and end offsets of 0 and 12.
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

     &spellcheck=true&spellcheck.collate=true

Is this an issue with my configuration/test or is there an issue with the
SpellingQueryConverter? Is there a recommended work around such as the
WhitespaceTokenizer as mention in the issue comments?

Thank you for your help.

package org.apache.solr.spelling;
import static org.junit.Assert.assertTrue;
import java.util.Collection;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.apache.solr.common.util.NamedList;
import org.junit.Test;
public class SimpleQueryConverterTest {
 @Test
public void testSimpleQueryConversion() {
SpellingQueryConverter converter = new SpellingQueryConverter();
 converter.init(new NamedList());
converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
String original = "another-test";
 Collection<Token> tokens = converter.convert(original);
assertTrue("Token offsets do not match",
isOffsetCorrect(original, tokens));
 }
private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
for (Token token : tokens) {
 int start = token.startOffset();
int end = token.endOffset();
if (!s.substring(start, end).equals(token.toString()))
 return false;
}
return true;
}
}

RE: Is there an issue with hypens in SpellChecker with StandardTokenizer?

Posted by Steven A Rowe <sa...@syr.edu>.

Brandon,

Looks like SOLR-2509 <https://issues.apache.org/jira/browse/SOLR-2509> fixed the problem - that's where OffsetAttribute was added (as you noted).

I ran my test method on branches/lucene_solr_3_5/, and I got the same failure there as you did, so I can confirm that Solr 3.5 has this bug, but that it will be fixed in Solr 3.6.

Steve

> -----Original Message-----
> From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
> Sent: Thursday, December 15, 2011 6:16 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is there an issue with hypens in SpellChecker with
> StandardTokenizer?
> 
> Yes the branch_3x works for me as well. The addition of the
> OffsetAttribute
> probably corrected this issue.  I will either switch to
> WhitespaceAnalyzer,
> patch my distribution or wait for 3.6 to resolve this.
> 
> Thanks.
> 
> On Thu, Dec 15, 2011 at 4:17 PM, Brandon Fish
> <br...@gmail.com>wrote:
> 
> > Hi Steve,
> >
> > I was using branch 3.5. I will try this on tip of branch_3x too.
> >
> > Thanks.
> >
> >
> > On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe <sa...@syr.edu> wrote:
> >
> >> Hi Brandon,
> >>
> >> When I add the following to SpellingQueryConverterTest.java on the tip
> of
> >> branch_3x (will be released as Solr 3.6), the test succeeds:
> >>
> >> @Test
> >> public void testStandardAnalyzerWithHyphen() {
> >>   SpellingQueryConverter converter = new SpellingQueryConverter();
> >>  converter.init(new NamedList());
> >>  converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> >>  String original = "another-test";
> >>  Collection<Token> tokens = converter.convert(original);
> >>   assertTrue("tokens is null and it shouldn't be", tokens != null);
> >>  assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
> >> tokens.size());
> >>   assertTrue("Token offsets do not match", isOffsetCorrect(original,
> >> tokens));
> >> }
> >>
> >> What version of Solr/Lucene are you using?
> >>
> >> Steve
> >>
> >> > -----Original Message-----
> >> > From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
> >> > Sent: Thursday, December 15, 2011 3:08 PM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Is there an issue with hypens in SpellChecker with
> >> > StandardTokenizer?
> >> >
> >> > I am getting an error using the SpellChecker component with the query
> >> > "another-test"
> >> > java.lang.StringIndexOutOfBoundsException: String index out of range:
> -7
> >> >
> >> > This appears to be related to this
> >> > issue<https://issues.apache.org/jira/browse/SOLR-1630> which
> >> > has been marked as fixed. My configuration and test case that follows
> >> > appear to reproduce the error I am seeing. Both "another" and "test"
> get
> >> > changed into tokens with start and end offsets of 0 and 12.
> >> >       <analyzer>
> >> >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >> >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> > words="stopwords.txt"/>
> >> >         <filter class="solr.LowerCaseFilterFactory"/>
> >> >       </analyzer>
> >> >
> >> >      &spellcheck=true&spellcheck.collate=true
> >> >
> >> > Is this an issue with my configuration/test or is there an issue with
> >> the
> >> > SpellingQueryConverter? Is there a recommended work around such as
> the
> >> > WhitespaceTokenizer as mention in the issue comments?
> >> >
> >> > Thank you for your help.
> >> >
> >> > package org.apache.solr.spelling;
> >> > import static org.junit.Assert.assertTrue;
> >> > import java.util.Collection;
> >> > import org.apache.lucene.analysis.Token;
> >> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> >> > import org.apache.lucene.util.Version;
> >> > import org.apache.solr.common.util.NamedList;
> >> > import org.junit.Test;
> >> > public class SimpleQueryConverterTest {
> >> >  @Test
> >> > public void testSimpleQueryConversion() {
> >> > SpellingQueryConverter converter = new SpellingQueryConverter();
> >> >  converter.init(new NamedList());
> >> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> >> > String original = "another-test";
> >> >  Collection<Token> tokens = converter.convert(original);
> >> > assertTrue("Token offsets do not match",
> >> > isOffsetCorrect(original, tokens));
> >> >  }
> >> > private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
> >> > for (Token token : tokens) {
> >> >  int start = token.startOffset();
> >> > int end = token.endOffset();
> >> > if (!s.substring(start, end).equals(token.toString()))
> >> >  return false;
> >> > }
> >> > return true;
> >> > }
> >> > }
> >>
> >
> >

Re: Is there an issue with hypens in SpellChecker with StandardTokenizer?

Posted by Brandon Fish <br...@gmail.com>.

Yes the branch_3x works for me as well. The addition of the OffsetAttribute
probably corrected this issue.  I will either switch to WhitespaceAnalyzer,
patch my distribution or wait for 3.6 to resolve this.

Thanks.

On Thu, Dec 15, 2011 at 4:17 PM, Brandon Fish <br...@gmail.com>wrote:

> Hi Steve,
>
> I was using branch 3.5. I will try this on tip of branch_3x too.
>
> Thanks.
>
>
> On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe <sa...@syr.edu> wrote:
>
>> Hi Brandon,
>>
>> When I add the following to SpellingQueryConverterTest.java on the tip of
>> branch_3x (will be released as Solr 3.6), the test succeeds:
>>
>> @Test
>> public void testStandardAnalyzerWithHyphen() {
>>   SpellingQueryConverter converter = new SpellingQueryConverter();
>>  converter.init(new NamedList());
>>  converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
>>  String original = "another-test";
>>  Collection<Token> tokens = converter.convert(original);
>>   assertTrue("tokens is null and it shouldn't be", tokens != null);
>>  assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
>> tokens.size());
>>   assertTrue("Token offsets do not match", isOffsetCorrect(original,
>> tokens));
>> }
>>
>> What version of Solr/Lucene are you using?
>>
>> Steve
>>
>> > -----Original Message-----
>> > From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
>> > Sent: Thursday, December 15, 2011 3:08 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Is there an issue with hypens in SpellChecker with
>> > StandardTokenizer?
>> >
>> > I am getting an error using the SpellChecker component with the query
>> > "another-test"
>> > java.lang.StringIndexOutOfBoundsException: String index out of range: -7
>> >
>> > This appears to be related to this
>> > issue<https://issues.apache.org/jira/browse/SOLR-1630> which
>> > has been marked as fixed. My configuration and test case that follows
>> > appear to reproduce the error I am seeing. Both "another" and "test" get
>> > changed into tokens with start and end offsets of 0 and 12.
>> >       <analyzer>
>> >         <tokenizer class="solr.StandardTokenizerFactory"/>
>> >     <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="stopwords.txt"/>
>> >         <filter class="solr.LowerCaseFilterFactory"/>
>> >       </analyzer>
>> >
>> >      &spellcheck=true&spellcheck.collate=true
>> >
>> > Is this an issue with my configuration/test or is there an issue with
>> the
>> > SpellingQueryConverter? Is there a recommended work around such as the
>> > WhitespaceTokenizer as mention in the issue comments?
>> >
>> > Thank you for your help.
>> >
>> > package org.apache.solr.spelling;
>> > import static org.junit.Assert.assertTrue;
>> > import java.util.Collection;
>> > import org.apache.lucene.analysis.Token;
>> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> > import org.apache.lucene.util.Version;
>> > import org.apache.solr.common.util.NamedList;
>> > import org.junit.Test;
>> > public class SimpleQueryConverterTest {
>> >  @Test
>> > public void testSimpleQueryConversion() {
>> > SpellingQueryConverter converter = new SpellingQueryConverter();
>> >  converter.init(new NamedList());
>> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
>> > String original = "another-test";
>> >  Collection<Token> tokens = converter.convert(original);
>> > assertTrue("Token offsets do not match",
>> > isOffsetCorrect(original, tokens));
>> >  }
>> > private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
>> > for (Token token : tokens) {
>> >  int start = token.startOffset();
>> > int end = token.endOffset();
>> > if (!s.substring(start, end).equals(token.toString()))
>> >  return false;
>> > }
>> > return true;
>> > }
>> > }
>>
>
>

Re: Is there an issue with hypens in SpellChecker with StandardTokenizer?

Posted by Brandon Fish <br...@gmail.com>.

Hi Steve,

I was using branch 3.5. I will try this on tip of branch_3x too.

Thanks.

On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Brandon,
>
> When I add the following to SpellingQueryConverterTest.java on the tip of
> branch_3x (will be released as Solr 3.6), the test succeeds:
>
> @Test
> public void testStandardAnalyzerWithHyphen() {
>   SpellingQueryConverter converter = new SpellingQueryConverter();
>  converter.init(new NamedList());
>  converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
>  String original = "another-test";
>  Collection<Token> tokens = converter.convert(original);
>   assertTrue("tokens is null and it shouldn't be", tokens != null);
>  assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
> tokens.size());
>   assertTrue("Token offsets do not match", isOffsetCorrect(original,
> tokens));
> }
>
> What version of Solr/Lucene are you using?
>
> Steve
>
> > -----Original Message-----
> > From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
> > Sent: Thursday, December 15, 2011 3:08 PM
> > To: solr-user@lucene.apache.org
> > Subject: Is there an issue with hypens in SpellChecker with
> > StandardTokenizer?
> >
> > I am getting an error using the SpellChecker component with the query
> > "another-test"
> > java.lang.StringIndexOutOfBoundsException: String index out of range: -7
> >
> > This appears to be related to this
> > issue<https://issues.apache.org/jira/browse/SOLR-1630> which
> > has been marked as fixed. My configuration and test case that follows
> > appear to reproduce the error I am seeing. Both "another" and "test" get
> > changed into tokens with start and end offsets of 0 and 12.
> >       <analyzer>
> >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >       </analyzer>
> >
> >      &spellcheck=true&spellcheck.collate=true
> >
> > Is this an issue with my configuration/test or is there an issue with the
> > SpellingQueryConverter? Is there a recommended work around such as the
> > WhitespaceTokenizer as mention in the issue comments?
> >
> > Thank you for your help.
> >
> > package org.apache.solr.spelling;
> > import static org.junit.Assert.assertTrue;
> > import java.util.Collection;
> > import org.apache.lucene.analysis.Token;
> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > import org.apache.lucene.util.Version;
> > import org.apache.solr.common.util.NamedList;
> > import org.junit.Test;
> > public class SimpleQueryConverterTest {
> >  @Test
> > public void testSimpleQueryConversion() {
> > SpellingQueryConverter converter = new SpellingQueryConverter();
> >  converter.init(new NamedList());
> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> > String original = "another-test";
> >  Collection<Token> tokens = converter.convert(original);
> > assertTrue("Token offsets do not match",
> > isOffsetCorrect(original, tokens));
> >  }
> > private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
> > for (Token token : tokens) {
> >  int start = token.startOffset();
> > int end = token.endOffset();
> > if (!s.substring(start, end).equals(token.toString()))
> >  return false;
> > }
> > return true;
> > }
> > }
>

RE: Is there an issue with hypens in SpellChecker with StandardTokenizer?

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Brandon,

When I add the following to SpellingQueryConverterTest.java on the tip of branch_3x (will be released as Solr 3.6), the test succeeds:

@Test
public void testStandardAnalyzerWithHyphen() {
  SpellingQueryConverter converter = new SpellingQueryConverter();
  converter.init(new NamedList());
  converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
  String original = "another-test";
  Collection<Token> tokens = converter.convert(original);
  assertTrue("tokens is null and it shouldn't be", tokens != null);
  assertEquals("tokens Size: " + tokens.size() + " is not 2", 2, tokens.size());
  assertTrue("Token offsets do not match", isOffsetCorrect(original, tokens));
}

What version of Solr/Lucene are you using?

Steve

> -----Original Message-----
> From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
> Sent: Thursday, December 15, 2011 3:08 PM
> To: solr-user@lucene.apache.org
> Subject: Is there an issue with hypens in SpellChecker with
> StandardTokenizer?
> 
> I am getting an error using the SpellChecker component with the query
> "another-test"
> java.lang.StringIndexOutOfBoundsException: String index out of range: -7
> 
> This appears to be related to this
> issue<https://issues.apache.org/jira/browse/SOLR-1630> which
> has been marked as fixed. My configuration and test case that follows
> appear to reproduce the error I am seeing. Both "another" and "test" get
> changed into tokens with start and end offsets of 0 and 12.
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
> 
>      &spellcheck=true&spellcheck.collate=true
> 
> Is this an issue with my configuration/test or is there an issue with the
> SpellingQueryConverter? Is there a recommended work around such as the
> WhitespaceTokenizer as mention in the issue comments?
> 
> Thank you for your help.
> 
> package org.apache.solr.spelling;
> import static org.junit.Assert.assertTrue;
> import java.util.Collection;
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.util.Version;
> import org.apache.solr.common.util.NamedList;
> import org.junit.Test;
> public class SimpleQueryConverterTest {
>  @Test
> public void testSimpleQueryConversion() {
> SpellingQueryConverter converter = new SpellingQueryConverter();
>  converter.init(new NamedList());
> converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> String original = "another-test";
>  Collection<Token> tokens = converter.convert(original);
> assertTrue("Token offsets do not match",
> isOffsetCorrect(original, tokens));
>  }
> private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
> for (Token token : tokens) {
>  int start = token.startOffset();
> int end = token.endOffset();
> if (!s.substring(start, end).equals(token.toString()))
>  return false;
> }
> return true;
> }
> }