You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Dyer, James" <Ja...@ingramcontent.com> on 2015/01/09 21:22:43 UTC

RE: can't make sense of spellchecker results when using techproducts example

Chris,

- DirectSpellChecker has a setting for "minPrefix" which the techproducts example sets to 1 (also the default).  So it will never try to correct the first character.  I think this is both a performance optimization and is based on the assumption that we rarely misspell the first character.  This is why it will not  correct "hell" to "dell".  I think it will allow you to set this to 0, if you want your sample query to work.

- The "maxCollationTries" feature re-writes "q" / "spellcheck.q", and then using all the other parameters, queries internally to see if there any hits.  This doesn't play very well when "q.op=OR" / "mm=1".  So when you see a collation like "here ultrasharp" / "heat ..." etc, you see it is indeed getting some hits.  So it considers it a valid query re-write, despite the absurdity.  We could improve this example config by adding "spellcheck.collateParam.q.op=AND" to the defaults.  (When using dismax, you would add "spellcheck.collateParam.mm=100%")  Also, while the "collateParam" functionality is in the old Solr wiki, it doesn't seem to be in the reference manual, so we probably should add it as this would be pretty important for a lot of users.

- Unless using the legacy IndexBasedSpellChecker / FileBasedSpellchecker, you need not use "spellcheck.build".  Its a no-op for both Direct and WordBreak, as these do not use sidecar indexes.

So without changing the config, these queries illustrate the spellchecker pretty well, including the word-break functionality.

http://localhost:8983/solr/techproducts/spell?spellcheck.q=dzll+ultra%20sharp&df=text&spellcheck=true&spellcheck.collateParam.q.op=AND
http://localhost:8983/solr/techproducts/spell?spellcheck.q=dellultrasharp&df=text&spellcheck=true&spellcheck.collateParam.q.op=AND

Spellcheck has a lot of gotchas, and I would wish we could dream up a way to make it easy for people.  I remember it being a struggle for me when I was a new user, and I know we get lots of questions on the user-list about it.

My apologies to you for not answering this sooner.

James Dyer
Ingram Content Group


-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Wednesday, December 17, 2014 6:49 PM
To: solr-user@lucene.apache.org
Subject: can't make sense of spellchecker results when using techproducts example


Ok, so i've been working on updating hte ref guide to account for hte new 
way to run the "examples" in 5.0.

The spell checking page...

 	https://cwiki.apache.org/confluence/display/solr/Spell+Checking

...has some examples that loosely corroloate to the "techproducts" 
example, but even if you ignore the specifics of those examples, i need 
help understanding the basic behavior of hte spellchecker as configured in 
the techproducts

Assuming you run this...

 	bin/solr -e techproducts

....with that example running & those docs indexed, this URL gives me 
results i can't explain...

http://localhost:8983/solr/techproducts/spell?spellcheck.q=hell+ultrashar&df=text&spellcheck=true&spellcheck.build=true

(see below)

1) "dell" is not listed as a possible suggestion for for "hell" (even if 
the dictionary thinks "hold" is a better suggestion, why isn't "dell" even 
included in the list of possibilities?

2) in the "collation" section, i can't make any sense of what these 
results mean -- how is "hello ultrasharp" a suggested collationQuery when 
*none* of the example docs contain both "hello" and "ultrasharp" ?

http://localhost:8983/solr/techproducts/select?df=text&q=%2Bhello+%2Bultrasharp


So WTF is up with these spell check results?


<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
   <int name="status">0</int>
   <int name="QTime">15</int>
</lst>
<str name="command">build</str>
<result name="response" numFound="0" start="0">
</result>
<lst name="spellcheck">
   <lst name="suggestions">
     <lst name="hell">
       <int name="numFound">6</int>
       <int name="startOffset">0</int>
       <int name="endOffset">4</int>
       <int name="origFreq">0</int>
       <arr name="suggestion">
         <lst>
           <str name="word">hello</str>
           <int name="freq">1</int>
         </lst>
         <lst>
           <str name="word">here</str>
           <int name="freq">2</int>
         </lst>
         <lst>
           <str name="word">heat</str>
           <int name="freq">1</int>
         </lst>
         <lst>
           <str name="word">hold</str>
           <int name="freq">1</int>
         </lst>
         <lst>
           <str name="word">html</str>
           <int name="freq">1</int>
         </lst>
         <lst>
           <str name="word">héllo</str>
           <int name="freq">1</int>
         </lst>
       </arr>
     </lst>
     <lst name="ultrashar">
       <int name="numFound">1</int>
       <int name="startOffset">5</int>
       <int name="endOffset">14</int>
       <int name="origFreq">0</int>
       <arr name="suggestion">
         <lst>
           <str name="word">ultrasharp</str>
           <int name="freq">1</int>
         </lst>
       </arr>
     </lst>
   </lst>
   <bool name="correctlySpelled">false</bool>
   <lst name="collations">
     <lst name="collation">
       <str name="collationQuery">hello ultrasharp</str>
       <int name="hits">2</int>
       <lst name="misspellingsAndCorrections">
         <str name="hell">hello</str>
         <str name="ultrashar">ultrasharp</str>
       </lst>
     </lst>
     <lst name="collation">
       <str name="collationQuery">here ultrasharp</str>
       <int name="hits">3</int>
       <lst name="misspellingsAndCorrections">
         <str name="hell">here</str>
         <str name="ultrashar">ultrasharp</str>
       </lst>
     </lst>
     <lst name="collation">
       <str name="collationQuery">heat ultrasharp</str>
       <int name="hits">2</int>
       <lst name="misspellingsAndCorrections">
         <str name="hell">heat</str>
         <str name="ultrashar">ultrasharp</str>
       </lst>
     </lst>
     <lst name="collation">
       <str name="collationQuery">hold ultrasharp</str>
       <int name="hits">2</int>
       <lst name="misspellingsAndCorrections">
         <str name="hell">hold</str>
         <str name="ultrashar">ultrasharp</str>
       </lst>
     </lst>
     <lst name="collation">
       <str name="collationQuery">html ultrasharp</str>
       <int name="hits">2</int>
       <lst name="misspellingsAndCorrections">
         <str name="hell">html</str>
         <str name="ultrashar">ultrasharp</str>
       </lst>
     </lst>
   </lst>
</lst>
</response>






-Hoss
http://www.lucidworks.com/


RE: can't make sense of spellchecker results when using techproducts example

Posted by Chris Hostetter <ho...@fucit.org>.
James: everything you said made perfect sense, and in hindsight was 
actually covered on the page -- it was just hte example that was bogus in 
light of the current config & defaults

I went ahead and fixed it based on your feedback, and beefed up the 
explanation of spellcheck.collateParam.* (now it's part of hte table 
instead of just a one off sentence out of context)

https://cwiki.apache.org/confluence/display/solr/Spell+Checking
https://cwiki.apache.org/confluence/pages/diffpages.action?pageId=32604254&originalId=50859120

thanks!



: Date: Fri, 9 Jan 2015 14:22:43 -0600
: From: "Dyer, James" <Ja...@ingramcontent.com>
: Reply-To: solr-user@lucene.apache.org
: To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
: Subject: RE: can't make sense of spellchecker results when using techproducts
:     example
: 
: Chris,
: 
: - DirectSpellChecker has a setting for "minPrefix" which the techproducts example sets to 1 (also the default).  So it will never try to correct the first character.  I think this is both a performance optimization and is based on the assumption that we rarely misspell the first character.  This is why it will not  correct "hell" to "dell".  I think it will allow you to set this to 0, if you want your sample query to work.
: 
: - The "maxCollationTries" feature re-writes "q" / "spellcheck.q", and then using all the other parameters, queries internally to see if there any hits.  This doesn't play very well when "q.op=OR" / "mm=1".  So when you see a collation like "here ultrasharp" / "heat ..." etc, you see it is indeed getting some hits.  So it considers it a valid query re-write, despite the absurdity.  We could improve this example config by adding "spellcheck.collateParam.q.op=AND" to the defaults.  (When using dismax, you would add "spellcheck.collateParam.mm=100%")  Also, while the "collateParam" functionality is in the old Solr wiki, it doesn't seem to be in the reference manual, so we probably should add it as this would be pretty important for a lot of users.
: 
: - Unless using the legacy IndexBasedSpellChecker / FileBasedSpellchecker, you need not use "spellcheck.build".  Its a no-op for both Direct and WordBreak, as these do not use sidecar indexes.
: 
: So without changing the config, these queries illustrate the spellchecker pretty well, including the word-break functionality.
: 
: http://localhost:8983/solr/techproducts/spell?spellcheck.q=dzll+ultra%20sharp&df=text&spellcheck=true&spellcheck.collateParam.q.op=AND
: http://localhost:8983/solr/techproducts/spell?spellcheck.q=dellultrasharp&df=text&spellcheck=true&spellcheck.collateParam.q.op=AND
: 
: Spellcheck has a lot of gotchas, and I would wish we could dream up a way to make it easy for people.  I remember it being a struggle for me when I was a new user, and I know we get lots of questions on the user-list about it.
: 
: My apologies to you for not answering this sooner.
: 
: James Dyer
: Ingram Content Group
: 
: 
: -----Original Message-----
: From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
: Sent: Wednesday, December 17, 2014 6:49 PM
: To: solr-user@lucene.apache.org
: Subject: can't make sense of spellchecker results when using techproducts example
: 
: 
: Ok, so i've been working on updating hte ref guide to account for hte new 
: way to run the "examples" in 5.0.
: 
: The spell checking page...
: 
:  	https://cwiki.apache.org/confluence/display/solr/Spell+Checking
: 
: ...has some examples that loosely corroloate to the "techproducts" 
: example, but even if you ignore the specifics of those examples, i need 
: help understanding the basic behavior of hte spellchecker as configured in 
: the techproducts
: 
: Assuming you run this...
: 
:  	bin/solr -e techproducts
: 
: ....with that example running & those docs indexed, this URL gives me 
: results i can't explain...
: 
: http://localhost:8983/solr/techproducts/spell?spellcheck.q=hell+ultrashar&df=text&spellcheck=true&spellcheck.build=true
: 
: (see below)
: 
: 1) "dell" is not listed as a possible suggestion for for "hell" (even if 
: the dictionary thinks "hold" is a better suggestion, why isn't "dell" even 
: included in the list of possibilities?
: 
: 2) in the "collation" section, i can't make any sense of what these 
: results mean -- how is "hello ultrasharp" a suggested collationQuery when 
: *none* of the example docs contain both "hello" and "ultrasharp" ?
: 
: http://localhost:8983/solr/techproducts/select?df=text&q=%2Bhello+%2Bultrasharp
: 
: 
: So WTF is up with these spell check results?
: 
: 
: <?xml version="1.0" encoding="UTF-8"?>
: <response>
: 
: <lst name="responseHeader">
:    <int name="status">0</int>
:    <int name="QTime">15</int>
: </lst>
: <str name="command">build</str>
: <result name="response" numFound="0" start="0">
: </result>
: <lst name="spellcheck">
:    <lst name="suggestions">
:      <lst name="hell">
:        <int name="numFound">6</int>
:        <int name="startOffset">0</int>
:        <int name="endOffset">4</int>
:        <int name="origFreq">0</int>
:        <arr name="suggestion">
:          <lst>
:            <str name="word">hello</str>
:            <int name="freq">1</int>
:          </lst>
:          <lst>
:            <str name="word">here</str>
:            <int name="freq">2</int>
:          </lst>
:          <lst>
:            <str name="word">heat</str>
:            <int name="freq">1</int>
:          </lst>
:          <lst>
:            <str name="word">hold</str>
:            <int name="freq">1</int>
:          </lst>
:          <lst>
:            <str name="word">html</str>
:            <int name="freq">1</int>
:          </lst>
:          <lst>
:            <str name="word">héllo</str>
:            <int name="freq">1</int>
:          </lst>
:        </arr>
:      </lst>
:      <lst name="ultrashar">
:        <int name="numFound">1</int>
:        <int name="startOffset">5</int>
:        <int name="endOffset">14</int>
:        <int name="origFreq">0</int>
:        <arr name="suggestion">
:          <lst>
:            <str name="word">ultrasharp</str>
:            <int name="freq">1</int>
:          </lst>
:        </arr>
:      </lst>
:    </lst>
:    <bool name="correctlySpelled">false</bool>
:    <lst name="collations">
:      <lst name="collation">
:        <str name="collationQuery">hello ultrasharp</str>
:        <int name="hits">2</int>
:        <lst name="misspellingsAndCorrections">
:          <str name="hell">hello</str>
:          <str name="ultrashar">ultrasharp</str>
:        </lst>
:      </lst>
:      <lst name="collation">
:        <str name="collationQuery">here ultrasharp</str>
:        <int name="hits">3</int>
:        <lst name="misspellingsAndCorrections">
:          <str name="hell">here</str>
:          <str name="ultrashar">ultrasharp</str>
:        </lst>
:      </lst>
:      <lst name="collation">
:        <str name="collationQuery">heat ultrasharp</str>
:        <int name="hits">2</int>
:        <lst name="misspellingsAndCorrections">
:          <str name="hell">heat</str>
:          <str name="ultrashar">ultrasharp</str>
:        </lst>
:      </lst>
:      <lst name="collation">
:        <str name="collationQuery">hold ultrasharp</str>
:        <int name="hits">2</int>
:        <lst name="misspellingsAndCorrections">
:          <str name="hell">hold</str>
:          <str name="ultrashar">ultrasharp</str>
:        </lst>
:      </lst>
:      <lst name="collation">
:        <str name="collationQuery">html ultrasharp</str>
:        <int name="hits">2</int>
:        <lst name="misspellingsAndCorrections">
:          <str name="hell">html</str>
:          <str name="ultrashar">ultrasharp</str>
:        </lst>
:      </lst>
:    </lst>
: </lst>
: </response>
: 
: 
: 
: 
: 
: 
: -Hoss
: http://www.lucidworks.com/
: 
: 

-Hoss
http://www.lucidworks.com/