You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by tom <de...@gmx.net> on 2012/03/22 09:57:19 UTC

possible spellcheck bug in 3.5 causing erroneous suggestions

hi folks,

i think i found a bug in the spellchecker but am not quite sure:
this is the query i send to solr:

http://lh:8983/solr/CompleteIndex/select?
&rows=0
&echoParams=all
&spellcheck=true
&spellcheck.onlyMorePopular=true
&spellcheck.extendedResults=no
&q=a+bb+ccc++dddd

and this is the result:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">4</int>
<lst name="params">
<str name="echoParams">all</str>
<str name="spellcheck">true</str>
<str name="echoParams">all</str>
<str name="spellcheck.extendedResults">no</str>
<str name="q">a bb ccc dddd</str>
<str name="rows">0</str>
<str name="spellcheck.onlyMorePopular">true</str>
</lst>
</lst>
<result name="response" numFound="43" start="0" />
<lst name="spellcheck">
<lst name="suggestions">
<lst name="bb">
<int name="numFound">1</int>
<int name="startOffset">2</int>
<int name="endOffset">4</int>
<arr name="suggestion">
<str>abb</str>
</arr>
</lst>
<lst name="cccc1">
<int name="numFound">1</int>
<int name="startOffset">5</int>
<int name="endOffset">8</int>
<arr name="suggestion">
<str>ccc</str>
</arr>
</lst>
<lst name="cccc2">
<int name="numFound">1</int>
<int name="startOffset">5</int>
<int name="endOffset">8</int>
<arr name="suggestion">
<str>ccc</str>
</arr>
</lst>
<lst name="dddd">
<int name="numFound">1</int>
<int name="startOffset">10</int>
<int name="endOffset">14</int>
<arr name="suggestion">
<str>dvd</str>
</arr>
</lst>
</lst>
</lst>
</response>

now, i know  this is just a technical query and i have done it for a 
test regarding suggestions and i discovered the oddity just by chance 
and was not regarding the test i did:
my question is regarding, how the suggestions cccc1 and cccc2 come 
about. from what i understand from the wiki, that the entries in 
spellcheck/suggestions are only (misspelled) substrings from the user query.

the setup/context is thus:
- the words a ccc exists 11 times in the index but cccc1 and 2 dont
         
http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 

<response><lst name="responseHeader"><int name="status">0</int><int 
name="QTime">1</int></lst><lst name="terms"><lst name="spell"><int 
name="ccc">11</int></lst></lst></response>
-  analyzer for the spellchecker yields the terms as entered, i.e. 
a|bb|ccc|dddd
-  the config is thus

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

<str name="queryAnalyzerFieldType">textSpell</str>

<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spell</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
</searchComponent>


does anyone have a clue what's going on?


RE: possible spellcheck bug in 3.5 causing erroneous suggestions

Posted by "Dyer, James" <Ja...@ingrambook.com>.
It might be easier to know what's going on if you provide some snippets from solrconfig.xml and schema.xml.  But my guess is that in your solrconfig.xml, under the spellcheck "searchComponent" either the "queryAnalyzerFieldType" or the "fieldType" (one level down) is set to a field that is removing numbers or otherwise modifying the tokens on analysis.  The reason is that your query contained "ccc" but it says that "cccc1" is a misspelled word in your query.  Typically you want a simple analysis chain that just tokenizes on whitespace and little else for spellchecking.

With that said, I wouldn't be surprised if this was a bug as we've had problems in the past with words containing numbers, dashes and the like.  If you become convinced you've found a bug, would you be able to write a failing unit test and post it on JIRA?  See http://wiki.apache.org/solr/HowToContribute for more information.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: tom [mailto:dev.tom.menzel@gmx.net] 
Sent: Tuesday, March 27, 2012 2:31 AM
To: solr-user@lucene.apache.org
Subject: Re: possible spellcheck bug in 3.5 causing erroneous suggestions

so any one has a clue what's (might be) going wrong ?

or do i have to debug and myself and post a jira issue?

PS: unfortunately i cant give anyone the index for testing due to NDA.

cheers

On 22.03.2012 10:17, tom wrote:
> same
>
> On 22.03.2012 10:00, Markus Jelsma wrote:
>> Can you try spellcheck.q ?
>>
>>
>> On Thu, 22 Mar 2012 09:57:19 +0100, tom <de...@gmx.net> wrote:
>>> hi folks,
>>>
>>> i think i found a bug in the spellchecker but am not quite sure:
>>> this is the query i send to solr:
>>>
>>> http://lh:8983/solr/CompleteIndex/select?
>>> &rows=0
>>> &echoParams=all
>>> &spellcheck=true
>>> &spellcheck.onlyMorePopular=true
>>> &spellcheck.extendedResults=no
>>> &q=a+bb+ccc++dddd
>>>
>>> and this is the result:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <response>
>>> <lst name="responseHeader">
>>> <int name="status">0</int>
>>> <int name="QTime">4</int>
>>> <lst name="params">
>>> <str name="echoParams">all</str>
>>> <str name="spellcheck">true</str>
>>> <str name="echoParams">all</str>
>>> <str name="spellcheck.extendedResults">no</str>
>>> <str name="q">a bb ccc dddd</str>
>>> <str name="rows">0</str>
>>> <str name="spellcheck.onlyMorePopular">true</str>
>>> </lst>
>>> </lst>
>>> <result name="response" numFound="43" start="0" />
>>> <lst name="spellcheck">
>>> <lst name="suggestions">
>>> <lst name="bb">
>>> <int name="numFound">1</int>
>>> <int name="startOffset">2</int>
>>> <int name="endOffset">4</int>
>>> <arr name="suggestion">
>>> <str>abb</str>
>>> </arr>
>>> </lst>
>>> <lst name="cccc1">
>>> <int name="numFound">1</int>
>>> <int name="startOffset">5</int>
>>> <int name="endOffset">8</int>
>>> <arr name="suggestion">
>>> <str>ccc</str>
>>> </arr>
>>> </lst>
>>> <lst name="cccc2">
>>> <int name="numFound">1</int>
>>> <int name="startOffset">5</int>
>>> <int name="endOffset">8</int>
>>> <arr name="suggestion">
>>> <str>ccc</str>
>>> </arr>
>>> </lst>
>>> <lst name="dddd">
>>> <int name="numFound">1</int>
>>> <int name="startOffset">10</int>
>>> <int name="endOffset">14</int>
>>> <arr name="suggestion">
>>> <str>dvd</str>
>>> </arr>
>>> </lst>
>>> </lst>
>>> </lst>
>>> </response>
>>>
>>> now, i know  this is just a technical query and i have done it for a
>>> test regarding suggestions and i discovered the oddity just by chance
>>> and was not regarding the test i did:
>>> my question is regarding, how the suggestions cccc1 and cccc2 come
>>> about. from what i understand from the wiki, that the entries in
>>> spellcheck/suggestions are only (misspelled) substrings from the user
>>> query.
>>>
>>> the setup/context is thus:
>>> - the words a ccc exists 11 times in the index but cccc1 and 2 dont
>>>
>>>
>>> http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 
>>>
>>>
>>>
>>> <response><lst name="responseHeader"><int name="status">0</int><int
>>> name="QTime">1</int></lst><lst name="terms"><lst name="spell"><int
>>> name="ccc">11</int></lst></lst></response>
>>> -  analyzer for the spellchecker yields the terms as entered, i.e.
>>> a|bb|ccc|dddd
>>> -  the config is thus
>>>
>>> <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>>>
>>> <str name="queryAnalyzerFieldType">textSpell</str>
>>>
>>> <lst name="spellchecker">
>>> <str name="name">default</str>
>>> <str name="field">spell</str>
>>> <str name="spellcheckIndexDir">./spellchecker</str>
>>> </lst>
>>> </searchComponent>
>>>
>>>
>>> does anyone have a clue what's going on?
>>
>>
>


Re: possible spellcheck bug in 3.5 causing erroneous suggestions

Posted by tom <de...@gmx.net>.
so any one has a clue what's (might be) going wrong ?

or do i have to debug and myself and post a jira issue?

PS: unfortunately i cant give anyone the index for testing due to NDA.

cheers

On 22.03.2012 10:17, tom wrote:
> same
>
> On 22.03.2012 10:00, Markus Jelsma wrote:
>> Can you try spellcheck.q ?
>>
>>
>> On Thu, 22 Mar 2012 09:57:19 +0100, tom <de...@gmx.net> wrote:
>>> hi folks,
>>>
>>> i think i found a bug in the spellchecker but am not quite sure:
>>> this is the query i send to solr:
>>>
>>> http://lh:8983/solr/CompleteIndex/select?
>>> &rows=0
>>> &echoParams=all
>>> &spellcheck=true
>>> &spellcheck.onlyMorePopular=true
>>> &spellcheck.extendedResults=no
>>> &q=a+bb+ccc++dddd
>>>
>>> and this is the result:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <response>
>>> <lst name="responseHeader">
>>> <int name="status">0</int>
>>> <int name="QTime">4</int>
>>> <lst name="params">
>>> <str name="echoParams">all</str>
>>> <str name="spellcheck">true</str>
>>> <str name="echoParams">all</str>
>>> <str name="spellcheck.extendedResults">no</str>
>>> <str name="q">a bb ccc dddd</str>
>>> <str name="rows">0</str>
>>> <str name="spellcheck.onlyMorePopular">true</str>
>>> </lst>
>>> </lst>
>>> <result name="response" numFound="43" start="0" />
>>> <lst name="spellcheck">
>>> <lst name="suggestions">
>>> <lst name="bb">
>>> <int name="numFound">1</int>
>>> <int name="startOffset">2</int>
>>> <int name="endOffset">4</int>
>>> <arr name="suggestion">
>>> <str>abb</str>
>>> </arr>
>>> </lst>
>>> <lst name="cccc1">
>>> <int name="numFound">1</int>
>>> <int name="startOffset">5</int>
>>> <int name="endOffset">8</int>
>>> <arr name="suggestion">
>>> <str>ccc</str>
>>> </arr>
>>> </lst>
>>> <lst name="cccc2">
>>> <int name="numFound">1</int>
>>> <int name="startOffset">5</int>
>>> <int name="endOffset">8</int>
>>> <arr name="suggestion">
>>> <str>ccc</str>
>>> </arr>
>>> </lst>
>>> <lst name="dddd">
>>> <int name="numFound">1</int>
>>> <int name="startOffset">10</int>
>>> <int name="endOffset">14</int>
>>> <arr name="suggestion">
>>> <str>dvd</str>
>>> </arr>
>>> </lst>
>>> </lst>
>>> </lst>
>>> </response>
>>>
>>> now, i know  this is just a technical query and i have done it for a
>>> test regarding suggestions and i discovered the oddity just by chance
>>> and was not regarding the test i did:
>>> my question is regarding, how the suggestions cccc1 and cccc2 come
>>> about. from what i understand from the wiki, that the entries in
>>> spellcheck/suggestions are only (misspelled) substrings from the user
>>> query.
>>>
>>> the setup/context is thus:
>>> - the words a ccc exists 11 times in the index but cccc1 and 2 dont
>>>
>>>
>>> http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 
>>>
>>>
>>>
>>> <response><lst name="responseHeader"><int name="status">0</int><int
>>> name="QTime">1</int></lst><lst name="terms"><lst name="spell"><int
>>> name="ccc">11</int></lst></lst></response>
>>> -  analyzer for the spellchecker yields the terms as entered, i.e.
>>> a|bb|ccc|dddd
>>> -  the config is thus
>>>
>>> <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>>>
>>> <str name="queryAnalyzerFieldType">textSpell</str>
>>>
>>> <lst name="spellchecker">
>>> <str name="name">default</str>
>>> <str name="field">spell</str>
>>> <str name="spellcheckIndexDir">./spellchecker</str>
>>> </lst>
>>> </searchComponent>
>>>
>>>
>>> does anyone have a clue what's going on?
>>
>>
>


Re: possible spellcheck bug in 3.5 causing erroneous suggestions

Posted by tom <de...@gmx.net>.
same

On 22.03.2012 10:00, Markus Jelsma wrote:
> Can you try spellcheck.q ?
>
>
> On Thu, 22 Mar 2012 09:57:19 +0100, tom <de...@gmx.net> wrote:
>> hi folks,
>>
>> i think i found a bug in the spellchecker but am not quite sure:
>> this is the query i send to solr:
>>
>> http://lh:8983/solr/CompleteIndex/select?
>> &rows=0
>> &echoParams=all
>> &spellcheck=true
>> &spellcheck.onlyMorePopular=true
>> &spellcheck.extendedResults=no
>> &q=a+bb+ccc++dddd
>>
>> and this is the result:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <response>
>> <lst name="responseHeader">
>> <int name="status">0</int>
>> <int name="QTime">4</int>
>> <lst name="params">
>> <str name="echoParams">all</str>
>> <str name="spellcheck">true</str>
>> <str name="echoParams">all</str>
>> <str name="spellcheck.extendedResults">no</str>
>> <str name="q">a bb ccc dddd</str>
>> <str name="rows">0</str>
>> <str name="spellcheck.onlyMorePopular">true</str>
>> </lst>
>> </lst>
>> <result name="response" numFound="43" start="0" />
>> <lst name="spellcheck">
>> <lst name="suggestions">
>> <lst name="bb">
>> <int name="numFound">1</int>
>> <int name="startOffset">2</int>
>> <int name="endOffset">4</int>
>> <arr name="suggestion">
>> <str>abb</str>
>> </arr>
>> </lst>
>> <lst name="cccc1">
>> <int name="numFound">1</int>
>> <int name="startOffset">5</int>
>> <int name="endOffset">8</int>
>> <arr name="suggestion">
>> <str>ccc</str>
>> </arr>
>> </lst>
>> <lst name="cccc2">
>> <int name="numFound">1</int>
>> <int name="startOffset">5</int>
>> <int name="endOffset">8</int>
>> <arr name="suggestion">
>> <str>ccc</str>
>> </arr>
>> </lst>
>> <lst name="dddd">
>> <int name="numFound">1</int>
>> <int name="startOffset">10</int>
>> <int name="endOffset">14</int>
>> <arr name="suggestion">
>> <str>dvd</str>
>> </arr>
>> </lst>
>> </lst>
>> </lst>
>> </response>
>>
>> now, i know  this is just a technical query and i have done it for a
>> test regarding suggestions and i discovered the oddity just by chance
>> and was not regarding the test i did:
>> my question is regarding, how the suggestions cccc1 and cccc2 come
>> about. from what i understand from the wiki, that the entries in
>> spellcheck/suggestions are only (misspelled) substrings from the user
>> query.
>>
>> the setup/context is thus:
>> - the words a ccc exists 11 times in the index but cccc1 and 2 dont
>>
>>
>> http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 
>>
>>
>>
>> <response><lst name="responseHeader"><int name="status">0</int><int
>> name="QTime">1</int></lst><lst name="terms"><lst name="spell"><int
>> name="ccc">11</int></lst></lst></response>
>> -  analyzer for the spellchecker yields the terms as entered, i.e.
>> a|bb|ccc|dddd
>> -  the config is thus
>>
>> <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>>
>> <str name="queryAnalyzerFieldType">textSpell</str>
>>
>> <lst name="spellchecker">
>> <str name="name">default</str>
>> <str name="field">spell</str>
>> <str name="spellcheckIndexDir">./spellchecker</str>
>> </lst>
>> </searchComponent>
>>
>>
>> does anyone have a clue what's going on?
>
>


Re: possible spellcheck bug in 3.5 causing erroneous suggestions

Posted by Markus Jelsma <ma...@openindex.io>.
 Can you try spellcheck.q ?


 On Thu, 22 Mar 2012 09:57:19 +0100, tom <de...@gmx.net> wrote:
> hi folks,
>
> i think i found a bug in the spellchecker but am not quite sure:
> this is the query i send to solr:
>
> http://lh:8983/solr/CompleteIndex/select?
> &rows=0
> &echoParams=all
> &spellcheck=true
> &spellcheck.onlyMorePopular=true
> &spellcheck.extendedResults=no
> &q=a+bb+ccc++dddd
>
> and this is the result:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">4</int>
> <lst name="params">
> <str name="echoParams">all</str>
> <str name="spellcheck">true</str>
> <str name="echoParams">all</str>
> <str name="spellcheck.extendedResults">no</str>
> <str name="q">a bb ccc dddd</str>
> <str name="rows">0</str>
> <str name="spellcheck.onlyMorePopular">true</str>
> </lst>
> </lst>
> <result name="response" numFound="43" start="0" />
> <lst name="spellcheck">
> <lst name="suggestions">
> <lst name="bb">
> <int name="numFound">1</int>
> <int name="startOffset">2</int>
> <int name="endOffset">4</int>
> <arr name="suggestion">
> <str>abb</str>
> </arr>
> </lst>
> <lst name="cccc1">
> <int name="numFound">1</int>
> <int name="startOffset">5</int>
> <int name="endOffset">8</int>
> <arr name="suggestion">
> <str>ccc</str>
> </arr>
> </lst>
> <lst name="cccc2">
> <int name="numFound">1</int>
> <int name="startOffset">5</int>
> <int name="endOffset">8</int>
> <arr name="suggestion">
> <str>ccc</str>
> </arr>
> </lst>
> <lst name="dddd">
> <int name="numFound">1</int>
> <int name="startOffset">10</int>
> <int name="endOffset">14</int>
> <arr name="suggestion">
> <str>dvd</str>
> </arr>
> </lst>
> </lst>
> </lst>
> </response>
>
> now, i know  this is just a technical query and i have done it for a
> test regarding suggestions and i discovered the oddity just by chance
> and was not regarding the test i did:
> my question is regarding, how the suggestions cccc1 and cccc2 come
> about. from what i understand from the wiki, that the entries in
> spellcheck/suggestions are only (misspelled) substrings from the user
> query.
>
> the setup/context is thus:
> - the words a ccc exists 11 times in the index but cccc1 and 2 dont
>
> 
> http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0
>
>
> <response><lst name="responseHeader"><int name="status">0</int><int
> name="QTime">1</int></lst><lst name="terms"><lst name="spell"><int
> name="ccc">11</int></lst></lst></response>
> -  analyzer for the spellchecker yields the terms as entered, i.e.
> a|bb|ccc|dddd
> -  the config is thus
>
> <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>
> <str name="queryAnalyzerFieldType">textSpell</str>
>
> <lst name="spellchecker">
> <str name="name">default</str>
> <str name="field">spell</str>
> <str name="spellcheckIndexDir">./spellchecker</str>
> </lst>
> </searchComponent>
>
>
> does anyone have a clue what's going on?