You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jason Rennie <jr...@gmail.com> on 2008/10/06 22:08:58 UTC

spellcheck: issues

I've noticed a few issues with spellcheck as I've been testing it out for
use on our site...

   1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a commit
   is going on and files are being rebuilt in the spellcheck data dir,
   spellcheck requests yield bogus answers.  I.e. I can issue identical
   requests and get drastically different answers.  The first time, I get
   suggestions and "correctlySpelled" is false.  The second time (during the
   commit), I get no suggestions and "correctlySpelled" is true.  Shouldn't
   spellcheck use the old index until the new one is ready for use, like solr
   does with optimizes?
   2. Inconsistent ordering - The first suggestion changes depending on the
   spellcheck.count that I specify.  If my query is "chanl" and I ask for one
   result, the suggestion is "chant" (freq. 16).  If I ask for 5 results, the
   first suggestion is also "chant"; the other 4 suggestions are less frequent
   (e.g. "chang", freq. 11).  However, if I ask for 10 results, the first
   suggestion is "chanel" (freq. 1296); #2 and #3 are "chant" and "chang"; #9
   is "chan" (freq. 174).  Shouldn't spellcheck return the best suggestion
   first?  In my case, shouldn't "chanel" always top "chant" and "chang" since
   they all have the same edit distance yet "chanel" is two orders of
   mangnitude more popular?

Is there anything I could be doing wrong to create these problems?  If not,
are these known issues?  If not, should I create jira's for them?

Thanks,

Jason

spellcheck: issues

Posted by Jason Rennie <jr...@gmail.com>.
Hello, I've been exploring usage of the spellcheck feature via solr 1.3.  I
have it working, but there are some issues I'm seeing that make it less
useful than it could be.  Response on the solr-user mailing list has been
limited.  I'm guessing the reason may be that I'm asking about issues which
are most relevant to the lucene codebase.  So, I hope you don't mind this
cross-posting.

I've noticed a few issues with spellcheck as I've been testing it out for
use on our site...

   1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a commit
   is going on and files are being rebuilt in the spellcheck data dir,
   spellcheck requests yield bogus answers.  I.e. I can issue identical
   requests and get drastically different answers.  The first time, I get
   suggestions and "correctlySpelled" is false.  The second time (during the
   commit), I get no suggestions and "correctlySpelled" is true.  Shouldn't
   spellcheck use the old index until the new one is ready for use, like solr
   does with optimizes?
   2. Inconsistent ordering - The first suggestion changes depending on the
   spellcheck.count that I specify.  If my query is "chanl" and I ask for one
   result, the suggestion is "chant" (freq. 16).  If I ask for 5 results, the
   first suggestion is also "chant"; the other 4 suggestions are less frequent
   (#2 is "chang", freq. 11).  However, if I ask for 10 results, the first
   suggestion is "chanel" (freq. 1296); #2 and #3 are "chant" and "chang"; #9
   is "chan" (freq. 174).  Shouldn't spellcheck always return the best
   suggestion first?  In my case, shouldn't "chanel" always top "chant" and
   "chang" since they all have the same edit distance yet "chanel" is two
   orders of mangnitude more popular?

Is there anything I could be doing wrong to create these problems?  If not,
are these known issues?  If not, should I create jira's for them?

Thanks,

Jason

Re: spellcheck: issues

Posted by Grant Ingersoll <gs...@apache.org>.
On Oct 6, 2008, at 4:08 PM, Jason Rennie wrote:

> I've noticed a few issues with spellcheck as I've been testing it  
> out for
> use on our site...
>
>   1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a  
> commit
>   is going on and files are being rebuilt in the spellcheck data dir,
>   spellcheck requests yield bogus answers.  I.e. I can issue identical
>   requests and get drastically different answers.  The first time, I  
> get
>   suggestions and "correctlySpelled" is false.  The second time  
> (during the
>   commit), I get no suggestions and "correctlySpelled" is true.   
> Shouldn't
>   spellcheck use the old index until the new one is ready for use,  
> like solr
>   does with optimizes?

Hmm, that sounds like a bug.

>
>   2. Inconsistent ordering - The first suggestion changes depending  
> on the
>   spellcheck.count that I specify.  If my query is "chanl" and I ask  
> for one
>   result, the suggestion is "chant" (freq. 16).  If I ask for 5  
> results, the
>   first suggestion is also "chant"; the other 4 suggestions are less  
> frequent
>   (e.g. "chang", freq. 11).  However, if I ask for 10 results, the  
> first
>   suggestion is "chanel" (freq. 1296); #2 and #3 are "chant" and  
> "chang"; #9
>   is "chan" (freq. 174).  Shouldn't spellcheck return the best  
> suggestion
>   first?  In my case, shouldn't "chanel" always top "chant" and  
> "chang" since
>   they all have the same edit distance yet "chanel" is two orders of
>   mangnitude more popular?

Is there anyway you can write up a small test case?  This definitely  
sounds like a bug.

>
>
> Is there anything I could be doing wrong to create these problems?

I suppose there is, but it doesn't sound like it.


> If not,
> are these known issues?  If not, should I create jira's for them?

Can you try to isolate it to a small repeatable example?

Thanks,
Grant

Re: spellcheck: issues

Posted by Jason Rennie <jr...@gmail.com>.
Ah, now I see.  Results are always sorted first by the edit distance, then
by the popularity.  What I think would work even better than allowing a
custom compareTo function would be to incorporate the frequency directly
into the distance function.  This would allow for greater control over the
trade-off between frequency and edit distance.  I'll file a jira and look at
submitting a patch.

Cheers,

Jason

On Thu, Oct 9, 2008 at 9:22 AM, Grant Ingersoll <gs...@apache.org> wrote:

> Sorting in the SpellChecker is handled by the SuggestWord.compareTo()
> method in Lucene.  It looks like:
> public final int compareTo(SuggestWord a) {
>    // first criteria: the edit distance
>    if (score > a.score) {
>      return 1;
>    }
>    if (score < a.score) {
>      return -1;
>    }
>
>    // second criteria (if first criteria is equal): the popularity
>    if (freq > a.freq) {
>      return 1;
>    }
>
>    if (freq < a.freq) {
>      return -1;
>    }
>    return 0;
>  }
>
> I could see you opening a JIRA issue in Lucene against the SC to make it so
> that the sorting could be overridden/pluggable.  A patch to do so would be
> even better ;-)
>
> Cheers,
> Grant
>



-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/

Re: spellcheck: issues

Posted by Grant Ingersoll <gs...@apache.org>.
On Oct 8, 2008, at 6:20 PM, Jason Rennie wrote:

> On Wed, Oct 8, 2008 at 3:31 PM, Jason Rennie <jr...@gmail.com>  
> wrote:
>
>> I just tried J-W and *yes* it seems to do a much better job!  I'd  
>> certainly
>> vote for that becoming the default :)
>>
>
> Ack!  I did some more testing and J-W results started to get weird
> (including suggesting "courses" for "coursets" even though "corsets"  
> is 4x
> as frequent as "courses", and "nylo" for "nylom" even though "nylon"  
> is 200x
> more frequent than "nylo").  The default measure got these right.   
> Does J-W
> use frequency information at all?
>

Sorting in the SpellChecker is handled by the SuggestWord.compareTo()  
method in Lucene.  It looks like:
public final int compareTo(SuggestWord a) {
     // first criteria: the edit distance
     if (score > a.score) {
       return 1;
     }
     if (score < a.score) {
       return -1;
     }

     // second criteria (if first criteria is equal): the popularity
     if (freq > a.freq) {
       return 1;
     }

     if (freq < a.freq) {
       return -1;
     }
     return 0;
   }

I could see you opening a JIRA issue in Lucene against the SC to make  
it so that the sorting could be overridden/pluggable.  A patch to do  
so would be even better ;-)

Cheers,
Grant

Re: spellcheck: issues

Posted by Jason Rennie <jr...@gmail.com>.
On Wed, Oct 8, 2008 at 3:31 PM, Jason Rennie <jr...@gmail.com> wrote:

> I just tried J-W and *yes* it seems to do a much better job!  I'd certainly
> vote for that becoming the default :)
>

Ack!  I did some more testing and J-W results started to get weird
(including suggesting "courses" for "coursets" even though "corsets" is 4x
as frequent as "courses", and "nylo" for "nylom" even though "nylon" is 200x
more frequent than "nylo").  The default measure got these right.  Does J-W
use frequency information at all?

Jason

P.S. I can see why you're offering a 3rd party solution---this is not an
easy problem to solve!

Re: spellcheck: issues

Posted by Jason Rennie <jr...@gmail.com>.
On Wed, Oct 8, 2008 at 3:05 PM, Grant Ingersoll <gs...@apache.org> wrote:

> chane is in the dictionary.  For better or worse, Lucene skips words that
> are in the dictionary when OMP is false.


Ah, I see.  I think we'll use OMP=true, which seems like a reasonable
setting anyway.


> Makes sense to me.  I could see the Spellchecker being modified (in Lucene)
> to provide alternate scoring/sorting.  Right now, you can use other distance
> measures, as well, so you could codify your idea and try it out to see if it
> is better (and then donate it!)
> You might try the Jaro-Winkler measure, too, as it is a bit more
> sophisticated than Levenstein when it comes to scoring.
>

I just tried J-W and *yes* it seems to do a much better job!  I'd certainly
vote for that becoming the default :)

Thanks for all the help!  Much appreciated.

Jason

-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/

Re: spellcheck: issues

Posted by Grant Ingersoll <gs...@apache.org>.
On Oct 8, 2008, at 2:03 PM, Jason Rennie wrote:

> On Wed, Oct 8, 2008 at 1:24 PM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>
>> Token: chane OMP: false
>> Oct 8, 2008 1:19:56 PM org.apache.solr.core.SolrCore execute
>> INFO: [spell] webapp=null path=/select
>> params={q=description 
>> %3Achane 
>> &spellcheck 
>> = 
>> true 
>> &spellcheck 
>> .onlyMorePopular 
>> =false&spellcheck.extendedResults=true&spellcheck.count=1}
>> hits=1 status=0 QTime=1
>> No Suggestions
>
>
> The result here seems wrong to me.  Shouldn't it suggest "chanel"?   
> You also
> tried this same query with OMP=true and it suggested "chanel".   
> Maybe I'm
> not understanding the purpose of OMP?  Shouldn't OMP=false return at  
> least
> as many suggestions as OMP=true?

chane is in the dictionary.  For better or worse, Lucene skips words  
that are in the dictionary when OMP is false.


>
>
> Token: chanl OMP: false
>> Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
>> INFO: [spell] webapp=null path=/select
>> params={q=description 
>> %3Achanl 
>> &spellcheck 
>> = 
>> true 
>> &spellcheck 
>> .onlyMorePopular 
>> =false&spellcheck.extendedResults=true&spellcheck.count=10}
>> hits=0 status=0 QTime=2
>>       Sugg[0]: [chanel, chant, chang, chani, chana, chane, charl,  
>> chand,
>> chan, chair]
>>       Sugg[0] Freqs: [834, 10, 8, 4, 1, 1, 1, 1, 106, 1950]
>>       Num Found 10
>>
>> ------
>>
>> 1)  Is this an accurate representation of what you are trying to  
>> convey?
>
>
> Yes.
>
> 2)  In light of this shared code that I hope captures both the  
> document side
>> and the query side, is the issue than highlighted by the last  
>> result above,
>> namely, that "chan" sorts after "chand" even though "chan" has a  
>> higher
>> frequency?
>
>
> I highlighted another issue above, but yes, the fact that "chan"  
> sorts below
> other single-edit terms with much lower frequencies seems like an  
> issue to
> me.  The Lucene SpellChecker page suggests a logical explanation:  
> terms are
> first sorted by the FuzzyQuery score (normalized edit distance),  
> then by
> popularity.  I'm wondering whether it would be better to sort by a  
> single,
> combined score, such as:
>
> NewSPerScore = (edit distance) * (suggestion term length) /  
> (original term
> length) + log_1000(frequency)
>
> Sorting according to this score would encourage longer suggestions,  
> but not
> at the expense of shorter, popular suggestion.  Might need to be  
> tweaked
> further, but I'd guess that it would do better than the two-step sort.
>

Makes sense to me.  I could see the Spellchecker being modified (in  
Lucene) to provide alternate scoring/sorting.  Right now, you can use  
other distance measures, as well, so you could codify your idea and  
try it out to see if it is better (and then donate it!)
You might try the Jaro-Winkler measure, too, as it is a bit more  
sophisticated than Levenstein when it comes to scoring.


Re: spellcheck: issues

Posted by Jason Rennie <jr...@gmail.com>.
On Wed, Oct 8, 2008 at 1:24 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Token: chane OMP: false
> Oct 8, 2008 1:19:56 PM org.apache.solr.core.SolrCore execute
> INFO: [spell] webapp=null path=/select
> params={q=description%3Achane&spellcheck=true&spellcheck.onlyMorePopular=false&spellcheck.extendedResults=true&spellcheck.count=1}
> hits=1 status=0 QTime=1
> No Suggestions


The result here seems wrong to me.  Shouldn't it suggest "chanel"?  You also
tried this same query with OMP=true and it suggested "chanel".  Maybe I'm
not understanding the purpose of OMP?  Shouldn't OMP=false return at least
as many suggestions as OMP=true?

Token: chanl OMP: false
> Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
> INFO: [spell] webapp=null path=/select
> params={q=description%3Achanl&spellcheck=true&spellcheck.onlyMorePopular=false&spellcheck.extendedResults=true&spellcheck.count=10}
> hits=0 status=0 QTime=2
>        Sugg[0]: [chanel, chant, chang, chani, chana, chane, charl, chand,
> chan, chair]
>        Sugg[0] Freqs: [834, 10, 8, 4, 1, 1, 1, 1, 106, 1950]
>        Num Found 10
>
> ------
>
> 1)  Is this an accurate representation of what you are trying to convey?


Yes.

2)  In light of this shared code that I hope captures both the document side
> and the query side, is the issue than highlighted by the last result above,
> namely, that "chan" sorts after "chand" even though "chan" has a higher
> frequency?


I highlighted another issue above, but yes, the fact that "chan" sorts below
other single-edit terms with much lower frequencies seems like an issue to
me.  The Lucene SpellChecker page suggests a logical explanation: terms are
first sorted by the FuzzyQuery score (normalized edit distance), then by
popularity.  I'm wondering whether it would be better to sort by a single,
combined score, such as:

NewSPerScore = (edit distance) * (suggestion term length) / (original term
length) + log_1000(frequency)

Sorting according to this score would encourage longer suggestions, but not
at the expense of shorter, popular suggestion.  Might need to be tweaked
further, but I'd guess that it would do better than the two-step sort.

Cheers,

Jason

Re: spellcheck: issues

Posted by Grant Ingersoll <gs...@apache.org>.
Hi Jason,

Here's what I did:

1. Took your code and modified it to be that of [1] below
2. Set up your config, schema, etc. as per the EmbeddedSolrServer  
paths in the code (a Maven like dir structure w/ src/main/resources/ 
solr/spell containing your configuration.
3. Ran the code.  My output is:
--------------------
Token: chanel OMP: false
Oct 8, 2008 1:19:56 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select params={q=description 
%3Achanel 
&spellcheck 
= 
true 
&spellcheck 
.onlyMorePopular 
=false&spellcheck.extendedResults=true&spellcheck.count=1} hits=834  
status=0 QTime=46
No Suggestions
--------------------
Token: chane OMP: false
Oct 8, 2008 1:19:56 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select params={q=description 
%3Achane 
&spellcheck 
= 
true 
&spellcheck 
.onlyMorePopular 
=false&spellcheck.extendedResults=true&spellcheck.count=1} hits=1  
status=0 QTime=1
No Suggestions
--------------------
Token: chane OMP: true
Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select params={q=description 
%3Achane 
&spellcheck 
= 
true 
&spellcheck 
.onlyMorePopular 
=true&spellcheck.extendedResults=true&spellcheck.count=1} hits=1  
status=0 QTime=15
	Sugg[0]: [chanel]
	Sugg[0] Freqs: [834]
	Num Found 1
--------------------
Token: chanl OMP: false
Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select params={q=description 
%3Achanl 
&spellcheck 
= 
true 
&spellcheck 
.onlyMorePopular 
=false&spellcheck.extendedResults=true&spellcheck.count=1} hits=0  
status=0 QTime=2
	Sugg[0]: [chanel]
	Sugg[0] Freqs: [834]
	Num Found 1
--------------------
Token: chanl OMP: false
Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select params={q=description 
%3Achanl 
&spellcheck 
= 
true 
&spellcheck 
.onlyMorePopular 
=false&spellcheck.extendedResults=true&spellcheck.count=5} hits=0  
status=0 QTime=2
	Sugg[0]: [chanel, chant, chang, chani, chane]
	Sugg[0] Freqs: [834, 10, 8, 4, 1]
	Num Found 5
--------------------
Token: chanl OMP: false
Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select params={q=description 
%3Achanl 
&spellcheck 
= 
true 
&spellcheck 
.onlyMorePopular 
=false&spellcheck.extendedResults=true&spellcheck.count=10} hits=0  
status=0 QTime=2
	Sugg[0]: [chanel, chant, chang, chani, chana, chane, charl, chand,  
chan, chair]
	Sugg[0] Freqs: [834, 10, 8, 4, 1, 1, 1, 1, 106, 1950]
	Num Found 10

------

1)  Is this an accurate representation of what you are trying to convey?
2)  In light of this shared code that I hope captures both the  
document side and the query side, is the issue than highlighted by the  
last result above, namely, that "chan" sorts after "chand" even though  
"chan" has a higher frequency?

Thanks,
Grant


[1]
package com.grantingersoll.noodles;

import junit.framework.TestCase;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.response.SpellCheckResponse;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.params.SpellingParams;
import org.apache.solr.core.CoreContainer;
import org.apache.solr.core.CoreDescriptor;
import org.apache.solr.core.SolrCore;
import org.apache.solr.handler.component.SpellCheckComponent;

import java.util.ArrayList;
import java.util.List;
import java.util.Collection;
import java.util.HashSet;
import java.io.File;


/**
  *
  *
  **/
public class SpellCheckingTest extends TestCase {


   public void testSpelling() throws Exception {
     List<Pair<String, Integer>> terms = new ArrayList<Pair<String,  
Integer>>();
     terms.add(new Pair<String, Integer>("chanel", 834));
     terms.add(new Pair<String, Integer>("chant", 10));
     terms.add(new Pair<String, Integer>("chang", 8));
     terms.add(new Pair<String, Integer>("chani", 4));
     terms.add(new Pair<String, Integer>("chand", 1));
     terms.add(new Pair<String, Integer>("chana", 1));
     terms.add(new Pair<String, Integer>("charl", 1));
     terms.add(new Pair<String, Integer>("chane", 1));
     terms.add(new Pair<String, Integer>("chan", 106));
     terms.add(new Pair<String, Integer>("chair", 1950));
     int id = 0;
     CoreContainer container = new CoreContainer("src/main/resources/ 
solr", new File("src/main/resources/solr/solr.xml"));
     //container.load();
     //SolrCore core = container.create(descriptor);
     final SolrServer client = new EmbeddedSolrServer(container,  
"spell");
     //client.setParser(new XMLResponseParser());
     Collection<SolrInputDocument> docs = new  
HashSet<SolrInputDocument>();
     for (Pair<String, Integer> term : terms) {
       final int freq = term.getSecond().intValue();
       for (int i = 0; i < freq; ++i) {
         SolrInputDocument doc = new SolrInputDocument();
         doc.addField("id", String.valueOf(++id));
         doc.addField("description", term.getFirst());
         docs.add(doc);
       }
     }
     client.add(docs);
     client.optimize();

     //buildSpellCheck(client);

     spellCheck(client, "chanel", false, 1);
     spellCheck(client, "chane", false, 1);
     spellCheck(client, "chane", true, 1);
     spellCheck(client, "chanl", false, 1);
     spellCheck(client, "chanl", false, 5);
     spellCheck(client, "chanl", false, 10);

   }

   private void spellCheck(SolrServer client, String token, boolean  
omp, int numSuggs) throws SolrServerException {
     System.out.println("--------------------");
     System.out.println("Token: " + token + " OMP: " + omp);
     SolrQuery query;
     QueryResponse rsp;
     SpellCheckResponse spRsp;
     query = new SolrQuery("description:" + token);
     query.set(SpellCheckComponent.COMPONENT_NAME, "true");
     query.set(SpellingParams.SPELLCHECK_ONLY_MORE_POPULAR,  
String.valueOf(omp));
     query.set(SpellingParams.SPELLCHECK_EXTENDED_RESULTS, "true");
     query.set(SpellingParams.SPELLCHECK_COUNT,  
String.valueOf(numSuggs));
     //query.setQueryType("dismax");
     rsp = client.query(query);
     spRsp = rsp.getSpellCheckResponse();

     //System.out.println("Response: " + rsp);
     List<SpellCheckResponse.Suggestion> suggestions =  
spRsp.getSuggestions();
     //System.out.println("Spelling: " + suggestions);
     printSuggestions(suggestions);
   }

   private void printSuggestions(List<SpellCheckResponse.Suggestion>  
suggestions) {
     int i = 0;
     if (suggestions.isEmpty() == false) {
       for (SpellCheckResponse.Suggestion sugg : suggestions) {


         System.out.println("\tSugg[" + i + "]: " +  
sugg.getSuggestions());
         System.out.println("\tSugg[" + i + "] Freqs: " +  
sugg.getSuggestionFrequencies());
         System.out.println("\tNum Found " + sugg.getNumFound());
       }
     } else {
       System.out.println("No Suggestions");
     }
   }


}

class Pair<S, T> {

   S first;

   T second;

   public Pair(S _first, T _second) {
     this.first = _first;
     this.second = _second;
   }

   public S getFirst() {
     return this.first;
   }

   public T getSecond() {
     return this.second;

   }

}


On Oct 8, 2008, at 10:22 AM, Jason Rennie wrote:

> Hi Grant,
>
> Here are solr config files (attached) and java code (included below)  
> to recreate the test case.
>
> Jason
>
>         List<Pair<String, Integer>> terms = new  
> ArrayList<Pair<String, Integer>>();
>         terms.add(new Pair<String, Integer>("chanel", 834));
>         terms.add(new Pair<String, Integer>("chant", 10));
>         terms.add(new Pair<String, Integer>("chang", 8));
>         terms.add(new Pair<String, Integer>("chani", 4));
>         terms.add(new Pair<String, Integer>("chand", 1));
>         terms.add(new Pair<String, Integer>("chana", 1));
>         terms.add(new Pair<String, Integer>("charl", 1));
>         terms.add(new Pair<String, Integer>("chane", 1));
>         terms.add(new Pair<String, Integer>("chan", 106));
>         terms.add(new Pair<String, Integer>("chair", 1950));
>         int id = 0;
>         final CommonsHttpSolrServer client = new  
> CommonsHttpSolrServer("http://solr:8080/solr/");
>         client.setParser(new XMLResponseParser());
>         for (Pair<String, Integer> term : terms) {
>             final int freq = term.getSecond().intValue();
>             for (int i = 0; i < freq; ++i) {
>                 SolrInputDocument doc = new SolrInputDocument();
>                 doc.addField("id", String.valueOf(++id));
>                 doc.addField("description", term.getFirst());
>                 client.add(doc);
>             }
>         }
>         client.optimize();
>
> Here's a Pair class:
>
> public class Pair<S, T> {
>
>     S first;
>
>     T second;
>
>     public Pair(S _first, T _second) {
>         this.first = _first;
>         this.second = _second;
>     }
>
>     public S getFirst() {
>         return this.first;
>     }
>
>     public T getSecond() {
>         return this.second;
>
>     }
>
> }
>
> <solrconfig.xml><schema.xml>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









Re: spellcheck: issues

Posted by Jason Rennie <jr...@gmail.com>.
Hi Grant,

Here are solr config files (attached) and java code (included below) to
recreate the test case.

Jason

        List<Pair<String, Integer>> terms = new ArrayList<Pair<String,
Integer>>();
        terms.add(new Pair<String, Integer>("chanel", 834));
        terms.add(new Pair<String, Integer>("chant", 10));
        terms.add(new Pair<String, Integer>("chang", 8));
        terms.add(new Pair<String, Integer>("chani", 4));
        terms.add(new Pair<String, Integer>("chand", 1));
        terms.add(new Pair<String, Integer>("chana", 1));
        terms.add(new Pair<String, Integer>("charl", 1));
        terms.add(new Pair<String, Integer>("chane", 1));
        terms.add(new Pair<String, Integer>("chan", 106));
        terms.add(new Pair<String, Integer>("chair", 1950));
        int id = 0;
        final CommonsHttpSolrServer client = new CommonsHttpSolrServer("
http://solr:8080/solr/");
        client.setParser(new XMLResponseParser());
        for (Pair<String, Integer> term : terms) {
            final int freq = term.getSecond().intValue();
            for (int i = 0; i < freq; ++i) {
                SolrInputDocument doc = new SolrInputDocument();
                doc.addField("id", String.valueOf(++id));
                doc.addField("description", term.getFirst());
                client.add(doc);
            }
        }
        client.optimize();

Here's a Pair class:

public class Pair<S, T> {

    S first;

    T second;

    public Pair(S _first, T _second) {
        this.first = _first;
        this.second = _second;
    }

    public S getFirst() {
        return this.first;
    }

    public T getSecond() {
        return this.second;
    }

}

Re: spellcheck: issues

Posted by Jason Rennie <jr...@gmail.com>.
Sure.  I just sent the relevant files/code directly to you.  Let me know if
you don't get them or have any trouble with them.

Jason

On Tue, Oct 7, 2008 at 3:27 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Can you share your spellchecker setup and the code for the test case?  I
> would like to reproduce it and see what's going on.
>
>
>
>
> On Oct 7, 2008, at 2:18 PM, Jason Rennie wrote:
>
>  On Tue, Oct 7, 2008 at 11:56 AM, Grant Ingersoll <gsingers@apache.org
>> >wrote:
>>
>>  Is there anyway you can write up a small test case?  This definitely
>>> sounds
>>> like a bug.
>>>
>>
>>
>> I tried adding single word documents according to the top ten suggestions
>> and frequencies for "chanl".  I.e. I created a fresh index, then added 834
>> "chanel" docs; 10 "chant" docs; 8 "chang" docs; 4 "chani" docs; 1 doc each
>> of "chand", "chana", "charl" and "chane"; 106 docs of "chan"; and 1950
>> docs
>> of "chair".  The fact that "chan" would come after the single-freq terms
>> seems wrong to me.
>>
>> I'm guessing the "FuzzyQuery score" (
>> http://wiki.apache.org/jakarta-lucene/SpellChecker) may be the reason for
>> some of the weird results I'm seeing.  Based on what I've seen and also
>> according to the SpellChecker wiki, it sounds like ordering is done first
>> by
>> this FuzzyQuery score ((edit distance)/(length of word)), then by
>> popularity.  This seems to explain "chan" coming after "chand" (above),
>> "candyâ" coming before "candy" and "yell" coming before "yello".
>>
>> On Tue, Oct 7, 2008 at 11:59 AM, Grant Ingersoll <gsingers@apache.org
>> >wrote:
>>
>>  Again, probably b/c of the distance.  What distance measure are you
>>> using?
>>>
>>
>>
>> I'm not specifying a distance measure.
>>
>>
>>  No, it should run in both cases.  Can you reproduce in a small test case?
>>>
>>
>>
>> In this test case I created, I searched for "chane" (with spellcheck=true)
>> and got one result.  When I searched for "chanel", it returned
>> numFound="834".  I have "accuracy" set to 0.5.  Should the spellchecker
>> not
>> suggest "chanel" for the "chane" query?
>>
>> Jason
>>
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>


-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Re: spellcheck: issues

Posted by Grant Ingersoll <gs...@apache.org>.
Can you share your spellchecker setup and the code for the test case?   
I would like to reproduce it and see what's going on.



On Oct 7, 2008, at 2:18 PM, Jason Rennie wrote:

> On Tue, Oct 7, 2008 at 11:56 AM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> Is there anyway you can write up a small test case?  This  
>> definitely sounds
>> like a bug.
>
>
> I tried adding single word documents according to the top ten  
> suggestions
> and frequencies for "chanl".  I.e. I created a fresh index, then  
> added 834
> "chanel" docs; 10 "chant" docs; 8 "chang" docs; 4 "chani" docs; 1  
> doc each
> of "chand", "chana", "charl" and "chane"; 106 docs of "chan"; and  
> 1950 docs
> of "chair".  The fact that "chan" would come after the single-freq  
> terms
> seems wrong to me.
>
> I'm guessing the "FuzzyQuery score" (
> http://wiki.apache.org/jakarta-lucene/SpellChecker) may be the  
> reason for
> some of the weird results I'm seeing.  Based on what I've seen and  
> also
> according to the SpellChecker wiki, it sounds like ordering is done  
> first by
> this FuzzyQuery score ((edit distance)/(length of word)), then by
> popularity.  This seems to explain "chan" coming after  
> "chand" (above),
> "candyâ" coming before "candy" and "yell" coming before "yello".
>
> On Tue, Oct 7, 2008 at 11:59 AM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> Again, probably b/c of the distance.  What distance measure are you  
>> using?
>
>
> I'm not specifying a distance measure.
>
>
>> No, it should run in both cases.  Can you reproduce in a small test  
>> case?
>
>
> In this test case I created, I searched for "chane" (with  
> spellcheck=true)
> and got one result.  When I searched for "chanel", it returned
> numFound="834".  I have "accuracy" set to 0.5.  Should the  
> spellchecker not
> suggest "chanel" for the "chane" query?
>
> Jason

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









Re: spellcheck: issues

Posted by Jason Rennie <jr...@gmail.com>.
On Tue, Oct 7, 2008 at 11:56 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Is there anyway you can write up a small test case?  This definitely sounds
> like a bug.


I tried adding single word documents according to the top ten suggestions
and frequencies for "chanl".  I.e. I created a fresh index, then added 834
"chanel" docs; 10 "chant" docs; 8 "chang" docs; 4 "chani" docs; 1 doc each
of "chand", "chana", "charl" and "chane"; 106 docs of "chan"; and 1950 docs
of "chair".  The fact that "chan" would come after the single-freq terms
seems wrong to me.

I'm guessing the "FuzzyQuery score" (
http://wiki.apache.org/jakarta-lucene/SpellChecker) may be the reason for
some of the weird results I'm seeing.  Based on what I've seen and also
according to the SpellChecker wiki, it sounds like ordering is done first by
this FuzzyQuery score ((edit distance)/(length of word)), then by
popularity.  This seems to explain "chan" coming after "chand" (above),
"candyâ" coming before "candy" and "yell" coming before "yello".

On Tue, Oct 7, 2008 at 11:59 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Again, probably b/c of the distance.  What distance measure are you using?


I'm not specifying a distance measure.


> No, it should run in both cases.  Can you reproduce in a small test case?


In this test case I created, I searched for "chane" (with spellcheck=true)
and got one result.  When I searched for "chanel", it returned
numFound="834".  I have "accuracy" set to 0.5.  Should the spellchecker not
suggest "chanel" for the "chane" query?

Jason

Re: spellcheck: issues

Posted by Grant Ingersoll <gs...@apache.org>.
On Oct 6, 2008, at 6:10 PM, Jason Rennie wrote:

> I've been using spellcheck.count=10 since that seems to yield a much  
> better
> top result than using the default count of 1.  However, I'm still  
> seeing
> weird cases.  Here are a few queries with returned suggestions.   
> Frequency
> counts are in parenthesis.
>
>   - query is "candyz".  Suggestions are: 1. "candyâ" (1), 2.  
> "candy" (965),
>   ...  #2 is vastly more popular than #1 and involves the same # of  
> edits.
>   Why would it order suggestions this way?

I'm guessing the edit distance is less????


>
>   - query is "yellw".  Suggestions are: 1. "yellow" (2880), 2.  
> "yello" (2),
>   3. "yelow" (1), 4. "yell" (74), ...  Shouldn't "yell" come before  
> "yello"
>   and "yelow" due to the higher frequency?

Again, probably b/c of the distance.  What distance measure are you  
using?

>
>   - query is "yello".  53 document hits.  No suggestions.  "yellow"  
> yields
>   36560 document.  Does the spellchecker only run when there are no  
> document
>   hits?

No, it should run in both cases.  Can you reproduce in a small test  
case?

>
>
> Btw, is there a better place to be posting comments/questions like  
> this?

Possibly, but here's the place to start.  They may be Lucene SC  
issues, but let's diagnose here, first, and then move to there if  
needed.


>
>
> Jason
>
> On Mon, Oct 6, 2008 at 4:08 PM, Jason Rennie <jr...@gmail.com>  
> wrote:
>
>> I've noticed a few issues with spellcheck as I've been testing it  
>> out for
>> use on our site...
>>
>>   1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a
>>   commit is going on and files are being rebuilt in the spellcheck  
>> data dir,
>>   spellcheck requests yield bogus answers.  I.e. I can issue  
>> identical
>>   requests and get drastically different answers.  The first time,  
>> I get
>>   suggestions and "correctlySpelled" is false.  The second time  
>> (during the
>>   commit), I get no suggestions and "correctlySpelled" is true.   
>> Shouldn't
>>   spellcheck use the old index until the new one is ready for use,  
>> like solr
>>   does with optimizes?
>>   2. Inconsistent ordering - The first suggestion changes depending  
>> on
>>   the spellcheck.count that I specify.  If my query is "chanl" and  
>> I ask for
>>   one result, the suggestion is "chant" (freq. 16).  If I ask for 5  
>> results,
>>   the first suggestion is also "chant"; the other 4 suggestions are  
>> less
>>   frequent (e.g. "chang", freq. 11).  However, if I ask for 10  
>> results, the
>>   first suggestion is "chanel" (freq. 1296); #2 and #3 are "chant"  
>> and
>>   "chang"; #9 is "chan" (freq. 174).  Shouldn't spellcheck return  
>> the best
>>   suggestion first?  In my case, shouldn't "chanel" always top  
>> "chant" and
>>   "chang" since they all have the same edit distance yet "chanel"  
>> is two
>>   orders of mangnitude more popular?
>>
>> Is there anything I could be doing wrong to create these problems?   
>> If not,
>> are these known issues?  If not, should I create jira's for them?
>>
>> Thanks,
>>
>> Jason
>>
>>
>
>
> -- 
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









Re: spellcheck: issues

Posted by Jason Rennie <jr...@gmail.com>.
I've been using spellcheck.count=10 since that seems to yield a much better
top result than using the default count of 1.  However, I'm still seeing
weird cases.  Here are a few queries with returned suggestions.  Frequency
counts are in parenthesis.

   - query is "candyz".  Suggestions are: 1. "candyâ" (1), 2. "candy" (965),
   ...  #2 is vastly more popular than #1 and involves the same # of edits.
   Why would it order suggestions this way?
   - query is "yellw".  Suggestions are: 1. "yellow" (2880), 2. "yello" (2),
   3. "yelow" (1), 4. "yell" (74), ...  Shouldn't "yell" come before "yello"
   and "yelow" due to the higher frequency?
   - query is "yello".  53 document hits.  No suggestions.  "yellow" yields
   36560 document.  Does the spellchecker only run when there are no document
   hits?

Btw, is there a better place to be posting comments/questions like this?

Jason

On Mon, Oct 6, 2008 at 4:08 PM, Jason Rennie <jr...@gmail.com> wrote:

> I've noticed a few issues with spellcheck as I've been testing it out for
> use on our site...
>
>    1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a
>    commit is going on and files are being rebuilt in the spellcheck data dir,
>    spellcheck requests yield bogus answers.  I.e. I can issue identical
>    requests and get drastically different answers.  The first time, I get
>    suggestions and "correctlySpelled" is false.  The second time (during the
>    commit), I get no suggestions and "correctlySpelled" is true.  Shouldn't
>    spellcheck use the old index until the new one is ready for use, like solr
>    does with optimizes?
>    2. Inconsistent ordering - The first suggestion changes depending on
>    the spellcheck.count that I specify.  If my query is "chanl" and I ask for
>    one result, the suggestion is "chant" (freq. 16).  If I ask for 5 results,
>    the first suggestion is also "chant"; the other 4 suggestions are less
>    frequent (e.g. "chang", freq. 11).  However, if I ask for 10 results, the
>    first suggestion is "chanel" (freq. 1296); #2 and #3 are "chant" and
>    "chang"; #9 is "chan" (freq. 174).  Shouldn't spellcheck return the best
>    suggestion first?  In my case, shouldn't "chanel" always top "chant" and
>    "chang" since they all have the same edit distance yet "chanel" is two
>    orders of mangnitude more popular?
>
> Is there anything I could be doing wrong to create these problems?  If not,
> are these known issues?  If not, should I create jira's for them?
>
> Thanks,
>
> Jason
>
>


-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/