You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Brandon Fish <br...@gmail.com> on 2011/12/15 21:07:41 UTC
Is there an issue with hypens in SpellChecker with StandardTokenizer?
I am getting an error using the SpellChecker component with the query
"another-test"
java.lang.StringIndexOutOfBoundsException: String index out of range: -7
This appears to be related to this
issue<https://issues.apache.org/jira/browse/SOLR-1630> which
has been marked as fixed. My configuration and test case that follows
appear to reproduce the error I am seeing. Both "another" and "test" get
changed into tokens with start and end offsets of 0 and 12.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
&spellcheck=true&spellcheck.collate=true
Is this an issue with my configuration/test or is there an issue with the
SpellingQueryConverter? Is there a recommended work around such as the
WhitespaceTokenizer as mention in the issue comments?
Thank you for your help.
package org.apache.solr.spelling;
import static org.junit.Assert.assertTrue;
import java.util.Collection;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.apache.solr.common.util.NamedList;
import org.junit.Test;
public class SimpleQueryConverterTest {
@Test
public void testSimpleQueryConversion() {
SpellingQueryConverter converter = new SpellingQueryConverter();
converter.init(new NamedList());
converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
String original = "another-test";
Collection<Token> tokens = converter.convert(original);
assertTrue("Token offsets do not match",
isOffsetCorrect(original, tokens));
}
private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
for (Token token : tokens) {
int start = token.startOffset();
int end = token.endOffset();
if (!s.substring(start, end).equals(token.toString()))
return false;
}
return true;
}
}
RE: Is there an issue with hypens in SpellChecker with
StandardTokenizer?
Posted by Steven A Rowe <sa...@syr.edu>.
Brandon,
Looks like SOLR-2509 <https://issues.apache.org/jira/browse/SOLR-2509> fixed the problem - that's where OffsetAttribute was added (as you noted).
I ran my test method on branches/lucene_solr_3_5/, and I got the same failure there as you did, so I can confirm that Solr 3.5 has this bug, but that it will be fixed in Solr 3.6.
Steve
> -----Original Message-----
> From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
> Sent: Thursday, December 15, 2011 6:16 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is there an issue with hypens in SpellChecker with
> StandardTokenizer?
>
> Yes the branch_3x works for me as well. The addition of the
> OffsetAttribute
> probably corrected this issue. I will either switch to
> WhitespaceAnalyzer,
> patch my distribution or wait for 3.6 to resolve this.
>
> Thanks.
>
> On Thu, Dec 15, 2011 at 4:17 PM, Brandon Fish
> <br...@gmail.com>wrote:
>
> > Hi Steve,
> >
> > I was using branch 3.5. I will try this on tip of branch_3x too.
> >
> > Thanks.
> >
> >
> > On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe <sa...@syr.edu> wrote:
> >
> >> Hi Brandon,
> >>
> >> When I add the following to SpellingQueryConverterTest.java on the tip
> of
> >> branch_3x (will be released as Solr 3.6), the test succeeds:
> >>
> >> @Test
> >> public void testStandardAnalyzerWithHyphen() {
> >> SpellingQueryConverter converter = new SpellingQueryConverter();
> >> converter.init(new NamedList());
> >> converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> >> String original = "another-test";
> >> Collection<Token> tokens = converter.convert(original);
> >> assertTrue("tokens is null and it shouldn't be", tokens != null);
> >> assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
> >> tokens.size());
> >> assertTrue("Token offsets do not match", isOffsetCorrect(original,
> >> tokens));
> >> }
> >>
> >> What version of Solr/Lucene are you using?
> >>
> >> Steve
> >>
> >> > -----Original Message-----
> >> > From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
> >> > Sent: Thursday, December 15, 2011 3:08 PM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Is there an issue with hypens in SpellChecker with
> >> > StandardTokenizer?
> >> >
> >> > I am getting an error using the SpellChecker component with the query
> >> > "another-test"
> >> > java.lang.StringIndexOutOfBoundsException: String index out of range:
> -7
> >> >
> >> > This appears to be related to this
> >> > issue<https://issues.apache.org/jira/browse/SOLR-1630> which
> >> > has been marked as fixed. My configuration and test case that follows
> >> > appear to reproduce the error I am seeing. Both "another" and "test"
> get
> >> > changed into tokens with start and end offsets of 0 and 12.
> >> > <analyzer>
> >> > <tokenizer class="solr.StandardTokenizerFactory"/>
> >> > <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> > words="stopwords.txt"/>
> >> > <filter class="solr.LowerCaseFilterFactory"/>
> >> > </analyzer>
> >> >
> >> > &spellcheck=true&spellcheck.collate=true
> >> >
> >> > Is this an issue with my configuration/test or is there an issue with
> >> the
> >> > SpellingQueryConverter? Is there a recommended work around such as
> the
> >> > WhitespaceTokenizer as mention in the issue comments?
> >> >
> >> > Thank you for your help.
> >> >
> >> > package org.apache.solr.spelling;
> >> > import static org.junit.Assert.assertTrue;
> >> > import java.util.Collection;
> >> > import org.apache.lucene.analysis.Token;
> >> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> >> > import org.apache.lucene.util.Version;
> >> > import org.apache.solr.common.util.NamedList;
> >> > import org.junit.Test;
> >> > public class SimpleQueryConverterTest {
> >> > @Test
> >> > public void testSimpleQueryConversion() {
> >> > SpellingQueryConverter converter = new SpellingQueryConverter();
> >> > converter.init(new NamedList());
> >> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> >> > String original = "another-test";
> >> > Collection<Token> tokens = converter.convert(original);
> >> > assertTrue("Token offsets do not match",
> >> > isOffsetCorrect(original, tokens));
> >> > }
> >> > private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
> >> > for (Token token : tokens) {
> >> > int start = token.startOffset();
> >> > int end = token.endOffset();
> >> > if (!s.substring(start, end).equals(token.toString()))
> >> > return false;
> >> > }
> >> > return true;
> >> > }
> >> > }
> >>
> >
> >
Re: Is there an issue with hypens in SpellChecker with StandardTokenizer?
Posted by Brandon Fish <br...@gmail.com>.
Yes the branch_3x works for me as well. The addition of the OffsetAttribute
probably corrected this issue. I will either switch to WhitespaceAnalyzer,
patch my distribution or wait for 3.6 to resolve this.
Thanks.
On Thu, Dec 15, 2011 at 4:17 PM, Brandon Fish <br...@gmail.com>wrote:
> Hi Steve,
>
> I was using branch 3.5. I will try this on tip of branch_3x too.
>
> Thanks.
>
>
> On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe <sa...@syr.edu> wrote:
>
>> Hi Brandon,
>>
>> When I add the following to SpellingQueryConverterTest.java on the tip of
>> branch_3x (will be released as Solr 3.6), the test succeeds:
>>
>> @Test
>> public void testStandardAnalyzerWithHyphen() {
>> SpellingQueryConverter converter = new SpellingQueryConverter();
>> converter.init(new NamedList());
>> converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
>> String original = "another-test";
>> Collection<Token> tokens = converter.convert(original);
>> assertTrue("tokens is null and it shouldn't be", tokens != null);
>> assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
>> tokens.size());
>> assertTrue("Token offsets do not match", isOffsetCorrect(original,
>> tokens));
>> }
>>
>> What version of Solr/Lucene are you using?
>>
>> Steve
>>
>> > -----Original Message-----
>> > From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
>> > Sent: Thursday, December 15, 2011 3:08 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Is there an issue with hypens in SpellChecker with
>> > StandardTokenizer?
>> >
>> > I am getting an error using the SpellChecker component with the query
>> > "another-test"
>> > java.lang.StringIndexOutOfBoundsException: String index out of range: -7
>> >
>> > This appears to be related to this
>> > issue<https://issues.apache.org/jira/browse/SOLR-1630> which
>> > has been marked as fixed. My configuration and test case that follows
>> > appear to reproduce the error I am seeing. Both "another" and "test" get
>> > changed into tokens with start and end offsets of 0 and 12.
>> > <analyzer>
>> > <tokenizer class="solr.StandardTokenizerFactory"/>
>> > <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="stopwords.txt"/>
>> > <filter class="solr.LowerCaseFilterFactory"/>
>> > </analyzer>
>> >
>> > &spellcheck=true&spellcheck.collate=true
>> >
>> > Is this an issue with my configuration/test or is there an issue with
>> the
>> > SpellingQueryConverter? Is there a recommended work around such as the
>> > WhitespaceTokenizer as mention in the issue comments?
>> >
>> > Thank you for your help.
>> >
>> > package org.apache.solr.spelling;
>> > import static org.junit.Assert.assertTrue;
>> > import java.util.Collection;
>> > import org.apache.lucene.analysis.Token;
>> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> > import org.apache.lucene.util.Version;
>> > import org.apache.solr.common.util.NamedList;
>> > import org.junit.Test;
>> > public class SimpleQueryConverterTest {
>> > @Test
>> > public void testSimpleQueryConversion() {
>> > SpellingQueryConverter converter = new SpellingQueryConverter();
>> > converter.init(new NamedList());
>> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
>> > String original = "another-test";
>> > Collection<Token> tokens = converter.convert(original);
>> > assertTrue("Token offsets do not match",
>> > isOffsetCorrect(original, tokens));
>> > }
>> > private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
>> > for (Token token : tokens) {
>> > int start = token.startOffset();
>> > int end = token.endOffset();
>> > if (!s.substring(start, end).equals(token.toString()))
>> > return false;
>> > }
>> > return true;
>> > }
>> > }
>>
>
>
Re: Is there an issue with hypens in SpellChecker with StandardTokenizer?
Posted by Brandon Fish <br...@gmail.com>.
Hi Steve,
I was using branch 3.5. I will try this on tip of branch_3x too.
Thanks.
On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe <sa...@syr.edu> wrote:
> Hi Brandon,
>
> When I add the following to SpellingQueryConverterTest.java on the tip of
> branch_3x (will be released as Solr 3.6), the test succeeds:
>
> @Test
> public void testStandardAnalyzerWithHyphen() {
> SpellingQueryConverter converter = new SpellingQueryConverter();
> converter.init(new NamedList());
> converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> String original = "another-test";
> Collection<Token> tokens = converter.convert(original);
> assertTrue("tokens is null and it shouldn't be", tokens != null);
> assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
> tokens.size());
> assertTrue("Token offsets do not match", isOffsetCorrect(original,
> tokens));
> }
>
> What version of Solr/Lucene are you using?
>
> Steve
>
> > -----Original Message-----
> > From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
> > Sent: Thursday, December 15, 2011 3:08 PM
> > To: solr-user@lucene.apache.org
> > Subject: Is there an issue with hypens in SpellChecker with
> > StandardTokenizer?
> >
> > I am getting an error using the SpellChecker component with the query
> > "another-test"
> > java.lang.StringIndexOutOfBoundsException: String index out of range: -7
> >
> > This appears to be related to this
> > issue<https://issues.apache.org/jira/browse/SOLR-1630> which
> > has been marked as fixed. My configuration and test case that follows
> > appear to reproduce the error I am seeing. Both "another" and "test" get
> > changed into tokens with start and end offsets of 0 and 12.
> > <analyzer>
> > <tokenizer class="solr.StandardTokenizerFactory"/>
> > <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> > </analyzer>
> >
> > &spellcheck=true&spellcheck.collate=true
> >
> > Is this an issue with my configuration/test or is there an issue with the
> > SpellingQueryConverter? Is there a recommended work around such as the
> > WhitespaceTokenizer as mention in the issue comments?
> >
> > Thank you for your help.
> >
> > package org.apache.solr.spelling;
> > import static org.junit.Assert.assertTrue;
> > import java.util.Collection;
> > import org.apache.lucene.analysis.Token;
> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > import org.apache.lucene.util.Version;
> > import org.apache.solr.common.util.NamedList;
> > import org.junit.Test;
> > public class SimpleQueryConverterTest {
> > @Test
> > public void testSimpleQueryConversion() {
> > SpellingQueryConverter converter = new SpellingQueryConverter();
> > converter.init(new NamedList());
> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> > String original = "another-test";
> > Collection<Token> tokens = converter.convert(original);
> > assertTrue("Token offsets do not match",
> > isOffsetCorrect(original, tokens));
> > }
> > private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
> > for (Token token : tokens) {
> > int start = token.startOffset();
> > int end = token.endOffset();
> > if (!s.substring(start, end).equals(token.toString()))
> > return false;
> > }
> > return true;
> > }
> > }
>
RE: Is there an issue with hypens in SpellChecker with
StandardTokenizer?
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Brandon,
When I add the following to SpellingQueryConverterTest.java on the tip of branch_3x (will be released as Solr 3.6), the test succeeds:
@Test
public void testStandardAnalyzerWithHyphen() {
SpellingQueryConverter converter = new SpellingQueryConverter();
converter.init(new NamedList());
converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
String original = "another-test";
Collection<Token> tokens = converter.convert(original);
assertTrue("tokens is null and it shouldn't be", tokens != null);
assertEquals("tokens Size: " + tokens.size() + " is not 2", 2, tokens.size());
assertTrue("Token offsets do not match", isOffsetCorrect(original, tokens));
}
What version of Solr/Lucene are you using?
Steve
> -----Original Message-----
> From: Brandon Fish [mailto:brandon.j.fish@gmail.com]
> Sent: Thursday, December 15, 2011 3:08 PM
> To: solr-user@lucene.apache.org
> Subject: Is there an issue with hypens in SpellChecker with
> StandardTokenizer?
>
> I am getting an error using the SpellChecker component with the query
> "another-test"
> java.lang.StringIndexOutOfBoundsException: String index out of range: -7
>
> This appears to be related to this
> issue<https://issues.apache.org/jira/browse/SOLR-1630> which
> has been marked as fixed. My configuration and test case that follows
> appear to reproduce the error I am seeing. Both "another" and "test" get
> changed into tokens with start and end offsets of 0 and 12.
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
>
> &spellcheck=true&spellcheck.collate=true
>
> Is this an issue with my configuration/test or is there an issue with the
> SpellingQueryConverter? Is there a recommended work around such as the
> WhitespaceTokenizer as mention in the issue comments?
>
> Thank you for your help.
>
> package org.apache.solr.spelling;
> import static org.junit.Assert.assertTrue;
> import java.util.Collection;
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.util.Version;
> import org.apache.solr.common.util.NamedList;
> import org.junit.Test;
> public class SimpleQueryConverterTest {
> @Test
> public void testSimpleQueryConversion() {
> SpellingQueryConverter converter = new SpellingQueryConverter();
> converter.init(new NamedList());
> converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> String original = "another-test";
> Collection<Token> tokens = converter.convert(original);
> assertTrue("Token offsets do not match",
> isOffsetCorrect(original, tokens));
> }
> private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
> for (Token token : tokens) {
> int start = token.startOffset();
> int end = token.endOffset();
> if (!s.substring(start, end).equals(token.toString()))
> return false;
> }
> return true;
> }
> }