You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/06/06 12:21:58 UTC

issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Hi,

We've had some issues with a bad zero-hits collation being returned for a two word query where one word was only one edit away from the required collation. With spellcheck.maxCollations to a reasonable number we saw the various suggestions without the required collation. We decreased thresholdTokenFrequency to make it appear in the list of collations. However, with collateExtendedResults=true the hits field for each collation was zero, which is incorrect.

Required collation=huub stapel (two hits) and q=huup stapel

      "collation":{
        "collationQuery":"heup stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"heup"}},
      "collation":{
        "collationQuery":"hugo stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hugo"}},
      "collation":{
        "collationQuery":"hulp stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hulp"}},
      "collation":{
        "collationQuery":"hup stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hup"}},
      "collation":{
        "collationQuery":"huub stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"huub"}},
      "collation":{
        "collationQuery":"huur stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"huur"}}}}}

Now, with maxCollationTries set to 3 or higher we finally get the required collation and the only collation able to return results. How can we determine the best value for maxCollationTries regarding the decrease of the thresholdTokenFrequency? Why is hits always zero?

This is with a today's build and distributed search enabled.

Thanks,
Markus

RE: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Posted by Markus Jelsma <ma...@openindex.io>.
Hello!

-----Original message-----
> From:Dyer, James <Ja...@ingrambook.com>
> Sent: Wed 06-Jun-2012 17:23
> To: solr-user@lucene.apache.org
> Subject: RE: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults
> 
> Markus,
> 
> With "maxCollationTries=0", it is not going out and querying the collations to see how many hits they each produce.  So it doesn't know the # of hits.  That is why if you also specify "collateExtendedResults=true", all the hit counts are zero.  It would probably be better in this case if it would not report "hits" in the extended response at all.  (On the other hand, if you're seeing zeros and "maxCollationTries>0", then you've hit a bug!)

I see. It would indeed make sense to get rid of the hits field when it's always zero anyway if maxCollationTries=0. Despite your recent explanations it raises some confusion.

> 
> "thresholdTokenFrequency" in my opinion is a pretty blunt instrument for getting rid of bad suggestions.  It takes out all of the rare terms, presuming that if a term is rare in the data it either is a mistake or isn't worthy to be suggested ever.  But if you're using "maxCollationTries" the suggestions that don't fit will be filtered out automatically, making "thresholdTokenFrequency" to be needed less.  (On the other hand, if you're using IndexBasedSpellChecker, "thresholdTokenFrequency" will make the dictionary smaller and "spellcheck.build" run faster...  This is solved entirely in 4.0 with DirectSolrSpellChecker...) 

I forgot to mention this is with the DirectSolrSpellChecker. I guess we'll just have to try working with the thresholdTokenFrequency. It's difficult, however, because the index will grow but changes are that at some point the rare, but correct, token drops below the threshold and is not suggested anymore. We also see the benefit from the threshold since our index is human editted and contains rare but misspelled words.

> 
> For the apps here, I've been using "maxCollationTries=10" and have been getting good results.  Keep in mind that even though you're allowing it to try up to 10 queries to find a viable collation, so long as you're setting "maxCollations" to something low it will (hopefully) seldom need to try more than a couple before finding one with hits.  (I always ask for only 1 collation as we just re-apply the spelling correction automatically if the original query returned nothing).  Also, if "spellcheck.count" is low it might not have enough terms available to try, so you might need to raise this value also if raising "maxCollationTries".

We have a similar set-up and require only one collation to be returned. I can increase maxCollationTries.

> 
> The worse problem, in my opinion is the fact that it won't ever suggest words if they're in the index (even if using "thresholdTokenFrequency" to remove them from the dictionary).  For that there is https://issues.apache.org/jira/browse/SOLR-2585 which is part of Solr 4.  The only other workaround is "onlyMorePopular" which has its own issues.  (see http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount).

We don't really like onlyMorePopular since more hits is not always a better suggestion. We decided to turn it off quite some time ago. Also because of SOLR-2555.AlternativeTermCount may indeed be a solution.

Thanks, we'll manage for now.

> 
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> Sent: Wednesday, June 06, 2012 5:22 AM
> To: solr-user@lucene.apache.org
> Subject: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults
> 
> Hi,
> 
> We've had some issues with a bad zero-hits collation being returned for a two word query where one word was only one edit away from the required collation. With spellcheck.maxCollations to a reasonable number we saw the various suggestions without the required collation. We decreased thresholdTokenFrequency to make it appear in the list of collations. However, with collateExtendedResults=true the hits field for each collation was zero, which is incorrect.
> 
> Required collation=huub stapel (two hits) and q=huup stapel
> 
>       "collation":{
>         "collationQuery":"heup stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"heup"}},
>       "collation":{
>         "collationQuery":"hugo stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"hugo"}},
>       "collation":{
>         "collationQuery":"hulp stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"hulp"}},
>       "collation":{
>         "collationQuery":"hup stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"hup"}},
>       "collation":{
>         "collationQuery":"huub stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"huub"}},
>       "collation":{
>         "collationQuery":"huur stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"huur"}}}}}
> 
> Now, with maxCollationTries set to 3 or higher we finally get the required collation and the only collation able to return results. How can we determine the best value for maxCollationTries regarding the decrease of the thresholdTokenFrequency? Why is hits always zero?
> 
> This is with a today's build and distributed search enabled.
> 
> Thanks,
> Markus
> 

RE: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Posted by "Dyer, James" <Ja...@ingrambook.com>.
Markus,

With "maxCollationTries=0", it is not going out and querying the collations to see how many hits they each produce.  So it doesn't know the # of hits.  That is why if you also specify "collateExtendedResults=true", all the hit counts are zero.  It would probably be better in this case if it would not report "hits" in the extended response at all.  (On the other hand, if you're seeing zeros and "maxCollationTries>0", then you've hit a bug!)

"thresholdTokenFrequency" in my opinion is a pretty blunt instrument for getting rid of bad suggestions.  It takes out all of the rare terms, presuming that if a term is rare in the data it either is a mistake or isn't worthy to be suggested ever.  But if you're using "maxCollationTries" the suggestions that don't fit will be filtered out automatically, making "thresholdTokenFrequency" to be needed less.  (On the other hand, if you're using IndexBasedSpellChecker, "thresholdTokenFrequency" will make the dictionary smaller and "spellcheck.build" run faster...  This is solved entirely in 4.0 with DirectSolrSpellChecker...) 

For the apps here, I've been using "maxCollationTries=10" and have been getting good results.  Keep in mind that even though you're allowing it to try up to 10 queries to find a viable collation, so long as you're setting "maxCollations" to something low it will (hopefully) seldom need to try more than a couple before finding one with hits.  (I always ask for only 1 collation as we just re-apply the spelling correction automatically if the original query returned nothing).  Also, if "spellcheck.count" is low it might not have enough terms available to try, so you might need to raise this value also if raising "maxCollationTries".

The worse problem, in my opinion is the fact that it won't ever suggest words if they're in the index (even if using "thresholdTokenFrequency" to remove them from the dictionary).  For that there is https://issues.apache.org/jira/browse/SOLR-2585 which is part of Solr 4.  The only other workaround is "onlyMorePopular" which has its own issues.  (see http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount).

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Wednesday, June 06, 2012 5:22 AM
To: solr-user@lucene.apache.org
Subject: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Hi,

We've had some issues with a bad zero-hits collation being returned for a two word query where one word was only one edit away from the required collation. With spellcheck.maxCollations to a reasonable number we saw the various suggestions without the required collation. We decreased thresholdTokenFrequency to make it appear in the list of collations. However, with collateExtendedResults=true the hits field for each collation was zero, which is incorrect.

Required collation=huub stapel (two hits) and q=huup stapel

      "collation":{
        "collationQuery":"heup stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"heup"}},
      "collation":{
        "collationQuery":"hugo stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hugo"}},
      "collation":{
        "collationQuery":"hulp stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hulp"}},
      "collation":{
        "collationQuery":"hup stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hup"}},
      "collation":{
        "collationQuery":"huub stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"huub"}},
      "collation":{
        "collationQuery":"huur stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"huur"}}}}}

Now, with maxCollationTries set to 3 or higher we finally get the required collation and the only collation able to return results. How can we determine the best value for maxCollationTries regarding the decrease of the thresholdTokenFrequency? Why is hits always zero?

This is with a today's build and distributed search enabled.

Thanks,
Markus

RE: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Posted by Markus Jelsma <ma...@openindex.io>.
Hi

The search is distributed over all shards. The problem exists locally as well.

Thanks,
 
-----Original message-----
> From:Jack Krupansky <ja...@basetechnology.com>
> Sent: Wed 06-Jun-2012 17:07
> To: solr-user@lucene.apache.org
> Subject: Re: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults
> 
> Do single-word queries return hits?
> 
> Is this a multi-shard environment? Does the request list all the shards 
> needed to give hits for all the collations you expect? Maybe the queries are 
> being done locally and don't have hits for the collations locally.
> 
> -- Jack Krupansky
> 
> -----Original Message----- 
> From: Markus Jelsma
> Sent: Wednesday, June 06, 2012 6:21 AM
> To: solr-user@lucene.apache.org
> Subject: issues with spellcheck.maxCollationTries and 
> spellcheck.collateExtendedResults
> 
> Hi,
> 
> We've had some issues with a bad zero-hits collation being returned for a 
> two word query where one word was only one edit away from the required 
> collation. With spellcheck.maxCollations to a reasonable number we saw the 
> various suggestions without the required collation. We decreased 
> thresholdTokenFrequency to make it appear in the list of collations. 
> However, with collateExtendedResults=true the hits field for each collation 
> was zero, which is incorrect.
> 
> Required collation=huub stapel (two hits) and q=huup stapel
> 
>       "collation":{
>         "collationQuery":"heup stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"heup"}},
>       "collation":{
>         "collationQuery":"hugo stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"hugo"}},
>       "collation":{
>         "collationQuery":"hulp stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"hulp"}},
>       "collation":{
>         "collationQuery":"hup stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"hup"}},
>       "collation":{
>         "collationQuery":"huub stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"huub"}},
>       "collation":{
>         "collationQuery":"huur stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"huur"}}}}}
> 
> Now, with maxCollationTries set to 3 or higher we finally get the required 
> collation and the only collation able to return results. How can we 
> determine the best value for maxCollationTries regarding the decrease of the 
> thresholdTokenFrequency? Why is hits always zero?
> 
> This is with a today's build and distributed search enabled.
> 
> Thanks,
> Markus 
> 
> 

Re: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Posted by Jack Krupansky <ja...@basetechnology.com>.
Do single-word queries return hits?

Is this a multi-shard environment? Does the request list all the shards 
needed to give hits for all the collations you expect? Maybe the queries are 
being done locally and don't have hits for the collations locally.

-- Jack Krupansky

-----Original Message----- 
From: Markus Jelsma
Sent: Wednesday, June 06, 2012 6:21 AM
To: solr-user@lucene.apache.org
Subject: issues with spellcheck.maxCollationTries and 
spellcheck.collateExtendedResults

Hi,

We've had some issues with a bad zero-hits collation being returned for a 
two word query where one word was only one edit away from the required 
collation. With spellcheck.maxCollations to a reasonable number we saw the 
various suggestions without the required collation. We decreased 
thresholdTokenFrequency to make it appear in the list of collations. 
However, with collateExtendedResults=true the hits field for each collation 
was zero, which is incorrect.

Required collation=huub stapel (two hits) and q=huup stapel

      "collation":{
        "collationQuery":"heup stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"heup"}},
      "collation":{
        "collationQuery":"hugo stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hugo"}},
      "collation":{
        "collationQuery":"hulp stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hulp"}},
      "collation":{
        "collationQuery":"hup stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hup"}},
      "collation":{
        "collationQuery":"huub stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"huub"}},
      "collation":{
        "collationQuery":"huur stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"huur"}}}}}

Now, with maxCollationTries set to 3 or higher we finally get the required 
collation and the only collation able to return results. How can we 
determine the best value for maxCollationTries regarding the decrease of the 
thresholdTokenFrequency? Why is hits always zero?

This is with a today's build and distributed search enabled.

Thanks,
Markus