You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sebastian M <mi...@yahoo.com> on 2010/12/22 16:32:18 UTC

Solr Spellcheker automatically tokenizes on period marks

Hello,


My main (full text) index contains the terms "www", "sometest", "com", which
is intended and correct.

My spellcheck index contains the term "www.sometest.com". which is also
intended and correct.

However, when querying the spellchecker using the query "www.sometest.com",
I get the suggestion "www.www.sometest.com.com", despite the fact that I'm
not using a tokenizer that splits on "." (period marks) as part of my
spellcheck query analyzer. 

When running the Field Analyzer (in the Solr admin page), I can see that
even after the last filter (see below), my term text remains
"www.sometest.com", which is untokenized, as expected. 

Any thoughts as to what may be causing this undesired tokenization?

To summarize:

Main index contains: "www", "sometest", "com"
Spellcheck index contains: "www.sometest.com"
Spellcheck query: "www.sometest.com"
Expected result: (no suggestion)
Actual result: "www.www.sometest.com.com"


Here is my spellcheck query analyzer:
<analyzer type="query">
	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
	<filter class="solr.StandardFilterFactory"/>
	<filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>



Thank you in advance; any suggestions are welcome!
Sebastian
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spellcheker-automatically-tokenizes-on-period-marks-tp2131844p2131844.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Spellcheker automatically tokenizes on period marks

Posted by Sebastian M <mi...@yahoo.com>.
I've noticed that the spellcheck component also seems to tokenize by itself
on question marks, not only period marks. 

Based on the spellcheck definition above, does anyone know how to stop Solr
from tokenizing strings on queries such as

www.sometest.com

(which causes suggestions of the form www.www.sometest.com.com)

It gets really messy if the user then clicks the above suggestion, which
causes a suggestion such as www.www.www.sometest.com.com.com to be given.

Thanks in advance!
Sebastian
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spellcheker-automatically-tokenizes-on-period-marks-tp2131844p2231170.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Spellcheker automatically tokenizes on period marks

Posted by Sebastian M <mi...@yahoo.com>.
Is it possible that the spellcheck query can be configured to stop tokenizing
on period marks through a parameter, rather than through the analyzer?
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spellcheker-automatically-tokenizes-on-period-marks-tp2131844p2138753.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Spellcheker automatically tokenizes on period marks

Posted by Sebastian M <mi...@yahoo.com>.
Hi and thanks for your reply,

My searchComponent is as such:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

    <str name="queryAnalyzerFieldType">textSpell</str>

...
</searchComponent>


And then in my schema.xml, I have:

<fieldType name="textSpell" class="solr.TextField"
positionIncrementGap="100" >
	<analyzer type="query">
	        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
	        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
	        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
	        <filter class="solr.StandardFilterFactory"/>
	        <filter class="solr.LowerCaseFilterFactory"/>
	        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
	</analyzer> 

...
</fieldType>


Which is the analyzer I pasted in my original post. So this only confirms
that the query term is going through these filters and tokenizer, but none
of them splits on period marks.

Do you see any possible problems with my setup?

Thanks!
Sebastian
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spellcheker-automatically-tokenizes-on-period-marks-tp2131844p2131959.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Spellcheker automatically tokenizes on period marks

Posted by Markus Jelsma <ma...@openindex.io>.
Check the analyzer of the field you defined for queryAnalyzerFieldType which is 
configured in the search component.

On Wednesday 22 December 2010 16:32:18 Sebastian M wrote:
> Hello,
> 
> 
> My main (full text) index contains the terms "www", "sometest", "com",
> which is intended and correct.
> 
> My spellcheck index contains the term "www.sometest.com". which is also
> intended and correct.
> 
> However, when querying the spellchecker using the query "www.sometest.com",
> I get the suggestion "www.www.sometest.com.com", despite the fact that I'm
> not using a tokenizer that splits on "." (period marks) as part of my
> spellcheck query analyzer.
> 
> When running the Field Analyzer (in the Solr admin page), I can see that
> even after the last filter (see below), my term text remains
> "www.sometest.com", which is untokenized, as expected.
> 
> Any thoughts as to what may be causing this undesired tokenization?
> 
> To summarize:
> 
> Main index contains: "www", "sometest", "com"
> Spellcheck index contains: "www.sometest.com"
> Spellcheck query: "www.sometest.com"
> Expected result: (no suggestion)
> Actual result: "www.www.sometest.com.com"
> 
> 
> Here is my spellcheck query analyzer:
> <analyzer type="query">
> 	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
> 	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 	<filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> 	<filter class="solr.StandardFilterFactory"/>
> 	<filter class="solr.LowerCaseFilterFactory"/>
> 	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> 
> 
> 
> Thank you in advance; any suggestions are welcome!
> Sebastian

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350