You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Martin Grotzke <ma...@javakaffee.de> on 2008/10/06 09:51:45 UTC

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Hi Jason,

what about multi-word searches like "harry potter"? When I do a search
in our index for "harry poter", I get the suggestion "harry
spotter" (using spellcheck.collate=true and jarowinkler distance).
Searching for "harry spotter" (we're searching AND, not OR) then gives
no results. I asume that this is because suggestions are done for words
separately, and this does not require that both/all suggestions are
contained in the same document.

I wonder what's the standard approach for searches with multiple words.
Are these working ok for you?

Cheers,
Martin

On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
> Hi Martin,
> 
> I'm a relative newbie to solr, have been playing with the spellcheck
> component and seem to have it working.  I certainly can't explain what all
> is going on, but with any luck, I can help you get the spellchecker
> up-and-running.  Additional replies in-lined below.
> 
> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <martin.grotzke@javakaffee.de
> > wrote:
> 
> > Now I'm thinking about the source-field in the spellchecker ("spell"):
> > how should fields be analyzed during indexing, and how should the
> > queryAnalyzerFieldType be configured.
> 
> 
> I followed the conventions in the default solrconfig.xml and schema.xml
> files.  So I created a "textSpell" field type (schema.xml):
> 
>     <!-- field type for the spell checker which doesn't stem -->
>     <fieldtype name="textSpell" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> 
> and used this for the queryAnalyzerFieldType.  I also created a spellField
> to store the text I want to spell check against and used the same analyzer
> (figuring that the query and indexed data should be analyzed the same way)
> (schema.xml):
> 
>    <!-- Spell check field -->
>    <field name="spellField" type="textSpell" indexed="true" stored="true" />
> 
> 
> 
> > If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the
> > field "brand") directly to the "spell" field. The "spell" field is of
> > type "string".
> 
> 
> We're copying description to spellField.  I'd recommend using a type like
> the above textSpell type since "The StringField type is not analyzed, but
> indexed/stored verbatim" (schema.xml):
> 
>   <copyField source="description" dest="spellField" />
> 
> Other fields like e.g. the product title I would first copy to some
> > whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
> > and afterwards to the "spell" field. The product title might be e.g.
> > "Canon EOS 450D EF-S 18-55 mm".
> 
> 
> Hmm... I'm not sure if this would work as I don't think the analyzer is
> applied until after the copy is made.  FWIW, I've had trouble copying
> multipe fields to spellField (i.e. adding a second copyField w/
> dest="spellField"), so we just index the spellchecker on a single field...
> 
> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
> > StandardTokenizerFactory here?
> 
> 
> I think if you use the same analyzer for indexing and queries, the
> distinction probably isn't tremendously important.  When I went searching,
> it looked like the StandardTokenizer split on non-letters.  I'd guess the
> rationale for using the StandardTokenizer is that it won't recommend
> non-letter characters.  I was seeing some weirdness earlier (no
> inserts/deletes), but that disappeared now that I'm using the
> StandardTokenizer.
> 
> Cheers,
> 
> Jason
-- 
Martin Grotzke
http://www.javakaffee.de/blog/

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Posted by Martin Grotzke <ma...@javakaffee.de>.

Thanx for your help so far, I just wanted to post my results here...

In short: Now I use the ShingleFilter to create shingles when copying my
fields into my field "spellMultiWords". For query time, I implemented a
MultiWordSpellingQueryConverter that just leaves the query as is, so
that there's only one token that is check for spelling suggestions.

Here's the detailed configuration:

= schema.xml =
    <fieldType name="textSpellMultiWords" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

   <field name="spellMultiWords" type="textSpellMultiWords" indexed="true" stored="true" multiValued="true"/>

   <copyField source="name" dest="spellMultiWords" />
   <copyField source="cat" dest="spellMultiWords" />
   ... and more ...


= solrconfig.xml =
  
  <searchComponent name="spellcheckMultiWords" class="solr.SpellCheckComponent">

    <!-- this is not used at all, can probably be omitted -->
    <str name="queryAnalyzerFieldType">textSpellMultiWords</str>

    <lst name="spellchecker">
      <!-- Optional, it is required when more than one spellchecker is configured -->
      <str name="name">default</str>
      <str name="field">spellMultiWords</str>
      <str name="spellcheckIndexDir">./spellcheckerMultiWords1</str>
      <str name="accuracy">0.5</str>
      <str name="buildOnCommit">true</str>
    </lst>
    <lst name="spellchecker">
      <str name="name">jarowinkler</str>
      <str name="field">spellMultiWords</str>
      <str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
      <str name="spellcheckIndexDir">./spellcheckerMultiWords2</str>
      <str name="buildOnCommit">true</str>
    </lst>
  </searchComponent>
  
  <queryConverter name="queryConverter" class="my.proj.solr.MultiWordSpellingQueryConverter"/>


= MultiWordSpellingQueryConverter =

public class MultiWordSpellingQueryConverter extends QueryConverter {
    
    /**
     * Converts the original query string to a collection of Lucene Tokens.
     * 
     * @param original the original query string
     * @return a Collection of Lucene Tokens
     */
    public Collection<Token> convert( String original ) {
        if ( original == null ) {
            return Collections.emptyList();
        }
        final Token token = new Token(0, original.length());
        token.setTermBuffer( original );
        return Arrays.asList( token );
    }
    
}



There are some issues still to be resolved:
- terms are lowercased in the index, there should happen some case
restoration
- we use stemming for our text field, so the spellchecker might suggest
searches, that lead to equal search results (e.g. the german2 stemmer
stems both "hose" and "hosen" to "hos" -> "Hose" and "Hosen" give the
same results)
- inconsistent/strange sorting of suggestions (as described in
http://www.nabble.com/spellcheck%3A-issues-td19845539.html).


Cheers,
Martin


On Mon, 2008-10-06 at 22:45 +0200, Martin Grotzke wrote:
> On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote: 
> > On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:
> > 
> > > Hi Jason,
> > >
> > > what about multi-word searches like "harry potter"? When I do a search
> > > in our index for "harry poter", I get the suggestion "harry
> > > spotter" (using spellcheck.collate=true and jarowinkler distance).
> > > Searching for "harry spotter" (we're searching AND, not OR) then gives
> > > no results. I asume that this is because suggestions are done for  
> > > words
> > > separately, and this does not require that both/all suggestions are
> > > contained in the same document.
> > >
> > 
> > Yeah, the SpellCheckComponent is not phrase aware.  My guess would be  
> > that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent) 
> >    that preserved phrases as a single token.  Likewise, you would need  
> > that on your indexing side as well for the spell checker.  In short, I  
> > suppose it's possible, but it would be work.  You probably could use  
> > the shingle filter (token based n-grams).
> I also thought about s.th. like this, and also stumbled over the
> ShingleFilter :)
> 
> So I would change the "spell" field to use the ShingleFilter?
> 
> Did I understand the answer to the posting "chaining copyFields"
> correctly, that I cannot pipe the title through some "shingledTitle"
> field and copy it afterwards to the "spell" field (while other fields
> like brand are copied directly to the spell field)?
> 
> Thanx && cheers,
> Martin
> 
> 
> > 
> > Alternatively, by using extendedResults, you can get back the  
> > frequency of each of the words, and then you could decide whether the  
> > collation is going to have any results assuming they are all or'd  
> > together.  For phrases and AND queries, I'm not sure.  It's doable,  
> > I'm sure, but it would be a lot more involved.
> > 
> > 
> > > I wonder what's the standard approach for searches with multiple  
> > > words.
> > > Are these working ok for you?
> > >
> > > Cheers,
> > > Martin
> > >
> > > On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
> > >> Hi Martin,
> > >>
> > >> I'm a relative newbie to solr, have been playing with the spellcheck
> > >> component and seem to have it working.  I certainly can't explain  
> > >> what all
> > >> is going on, but with any luck, I can help you get the spellchecker
> > >> up-and-running.  Additional replies in-lined below.
> > >>
> > >> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <martin.grotzke@javakaffee.de
> > >>> wrote:
> > >>
> > >>> Now I'm thinking about the source-field in the spellchecker  
> > >>> ("spell"):
> > >>> how should fields be analyzed during indexing, and how should the
> > >>> queryAnalyzerFieldType be configured.
> > >>
> > >>
> > >> I followed the conventions in the default solrconfig.xml and  
> > >> schema.xml
> > >> files.  So I created a "textSpell" field type (schema.xml):
> > >>
> > >>    <!-- field type for the spell checker which doesn't stem -->
> > >>    <fieldtype name="textSpell" class="solr.TextField"
> > >> positionIncrementGap="100">
> > >>      <analyzer>
> > >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> > >>        <filter class="solr.LowerCaseFilterFactory"/>
> > >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >>      </analyzer>
> > >>    </fieldtype>
> > >>
> > >> and used this for the queryAnalyzerFieldType.  I also created a  
> > >> spellField
> > >> to store the text I want to spell check against and used the same  
> > >> analyzer
> > >> (figuring that the query and indexed data should be analyzed the  
> > >> same way)
> > >> (schema.xml):
> > >>
> > >>   <!-- Spell check field -->
> > >>   <field name="spellField" type="textSpell" indexed="true"  
> > >> stored="true" />
> > >>
> > >>
> > >>
> > >>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them  
> > >>> (the
> > >>> field "brand") directly to the "spell" field. The "spell" field is  
> > >>> of
> > >>> type "string".
> > >>
> > >>
> > >> We're copying description to spellField.  I'd recommend using a  
> > >> type like
> > >> the above textSpell type since "The StringField type is not  
> > >> analyzed, but
> > >> indexed/stored verbatim" (schema.xml):
> > >>
> > >>  <copyField source="description" dest="spellField" />
> > >>
> > >> Other fields like e.g. the product title I would first copy to some
> > >>> whitespaceTokinized field (field type with  
> > >>> WhitespaceTokenizerFactory)
> > >>> and afterwards to the "spell" field. The product title might be e.g.
> > >>> "Canon EOS 450D EF-S 18-55 mm".
> > >>
> > >>
> > >> Hmm... I'm not sure if this would work as I don't think the  
> > >> analyzer is
> > >> applied until after the copy is made.  FWIW, I've had trouble copying
> > >> multipe fields to spellField (i.e. adding a second copyField w/
> > >> dest="spellField"), so we just index the spellchecker on a single  
> > >> field...
> > >>
> > >> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to  
> > >> use a
> > >>> StandardTokenizerFactory here?
> > >>
> > >>
> > >> I think if you use the same analyzer for indexing and queries, the
> > >> distinction probably isn't tremendously important.  When I went  
> > >> searching,
> > >> it looked like the StandardTokenizer split on non-letters.  I'd  
> > >> guess the
> > >> rationale for using the StandardTokenizer is that it won't recommend
> > >> non-letter characters.  I was seeing some weirdness earlier (no
> > >> inserts/deletes), but that disappeared now that I'm using the
> > >> StandardTokenizer.
> > >>
> > >> Cheers,
> > >>
> > >> Jason
> > > -- 
> > > Martin Grotzke
> > > http://www.javakaffee.de/blog/
> > 
> > --------------------------
> > Grant Ingersoll
> > 
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Posted by Martin Grotzke <ma...@javakaffee.de>.

On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote: 
> On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:
> 
> > Hi Jason,
> >
> > what about multi-word searches like "harry potter"? When I do a search
> > in our index for "harry poter", I get the suggestion "harry
> > spotter" (using spellcheck.collate=true and jarowinkler distance).
> > Searching for "harry spotter" (we're searching AND, not OR) then gives
> > no results. I asume that this is because suggestions are done for  
> > words
> > separately, and this does not require that both/all suggestions are
> > contained in the same document.
> >
> 
> Yeah, the SpellCheckComponent is not phrase aware.  My guess would be  
> that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent) 
>    that preserved phrases as a single token.  Likewise, you would need  
> that on your indexing side as well for the spell checker.  In short, I  
> suppose it's possible, but it would be work.  You probably could use  
> the shingle filter (token based n-grams).
I also thought about s.th. like this, and also stumbled over the
ShingleFilter :)

So I would change the "spell" field to use the ShingleFilter?

Did I understand the answer to the posting "chaining copyFields"
correctly, that I cannot pipe the title through some "shingledTitle"
field and copy it afterwards to the "spell" field (while other fields
like brand are copied directly to the spell field)?

Thanx && cheers,
Martin


> 
> Alternatively, by using extendedResults, you can get back the  
> frequency of each of the words, and then you could decide whether the  
> collation is going to have any results assuming they are all or'd  
> together.  For phrases and AND queries, I'm not sure.  It's doable,  
> I'm sure, but it would be a lot more involved.
> 
> 
> > I wonder what's the standard approach for searches with multiple  
> > words.
> > Are these working ok for you?
> >
> > Cheers,
> > Martin
> >
> > On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
> >> Hi Martin,
> >>
> >> I'm a relative newbie to solr, have been playing with the spellcheck
> >> component and seem to have it working.  I certainly can't explain  
> >> what all
> >> is going on, but with any luck, I can help you get the spellchecker
> >> up-and-running.  Additional replies in-lined below.
> >>
> >> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <martin.grotzke@javakaffee.de
> >>> wrote:
> >>
> >>> Now I'm thinking about the source-field in the spellchecker  
> >>> ("spell"):
> >>> how should fields be analyzed during indexing, and how should the
> >>> queryAnalyzerFieldType be configured.
> >>
> >>
> >> I followed the conventions in the default solrconfig.xml and  
> >> schema.xml
> >> files.  So I created a "textSpell" field type (schema.xml):
> >>
> >>    <!-- field type for the spell checker which doesn't stem -->
> >>    <fieldtype name="textSpell" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer>
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>    </fieldtype>
> >>
> >> and used this for the queryAnalyzerFieldType.  I also created a  
> >> spellField
> >> to store the text I want to spell check against and used the same  
> >> analyzer
> >> (figuring that the query and indexed data should be analyzed the  
> >> same way)
> >> (schema.xml):
> >>
> >>   <!-- Spell check field -->
> >>   <field name="spellField" type="textSpell" indexed="true"  
> >> stored="true" />
> >>
> >>
> >>
> >>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them  
> >>> (the
> >>> field "brand") directly to the "spell" field. The "spell" field is  
> >>> of
> >>> type "string".
> >>
> >>
> >> We're copying description to spellField.  I'd recommend using a  
> >> type like
> >> the above textSpell type since "The StringField type is not  
> >> analyzed, but
> >> indexed/stored verbatim" (schema.xml):
> >>
> >>  <copyField source="description" dest="spellField" />
> >>
> >> Other fields like e.g. the product title I would first copy to some
> >>> whitespaceTokinized field (field type with  
> >>> WhitespaceTokenizerFactory)
> >>> and afterwards to the "spell" field. The product title might be e.g.
> >>> "Canon EOS 450D EF-S 18-55 mm".
> >>
> >>
> >> Hmm... I'm not sure if this would work as I don't think the  
> >> analyzer is
> >> applied until after the copy is made.  FWIW, I've had trouble copying
> >> multipe fields to spellField (i.e. adding a second copyField w/
> >> dest="spellField"), so we just index the spellchecker on a single  
> >> field...
> >>
> >> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to  
> >> use a
> >>> StandardTokenizerFactory here?
> >>
> >>
> >> I think if you use the same analyzer for indexing and queries, the
> >> distinction probably isn't tremendously important.  When I went  
> >> searching,
> >> it looked like the StandardTokenizer split on non-letters.  I'd  
> >> guess the
> >> rationale for using the StandardTokenizer is that it won't recommend
> >> non-letter characters.  I was seeing some weirdness earlier (no
> >> inserts/deletes), but that disappeared now that I'm using the
> >> StandardTokenizer.
> >>
> >> Cheers,
> >>
> >> Jason
> > -- 
> > Martin Grotzke
> > http://www.javakaffee.de/blog/
> 
> --------------------------
> Grant Ingersoll
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> 
>

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:

> Hi Jason,
>
> what about multi-word searches like "harry potter"? When I do a search
> in our index for "harry poter", I get the suggestion "harry
> spotter" (using spellcheck.collate=true and jarowinkler distance).
> Searching for "harry spotter" (we're searching AND, not OR) then gives
> no results. I asume that this is because suggestions are done for  
> words
> separately, and this does not require that both/all suggestions are
> contained in the same document.
>

Yeah, the SpellCheckComponent is not phrase aware.  My guess would be  
that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent) 
   that preserved phrases as a single token.  Likewise, you would need  
that on your indexing side as well for the spell checker.  In short, I  
suppose it's possible, but it would be work.  You probably could use  
the shingle filter (token based n-grams).

Alternatively, by using extendedResults, you can get back the  
frequency of each of the words, and then you could decide whether the  
collation is going to have any results assuming they are all or'd  
together.  For phrases and AND queries, I'm not sure.  It's doable,  
I'm sure, but it would be a lot more involved.


> I wonder what's the standard approach for searches with multiple  
> words.
> Are these working ok for you?
>
> Cheers,
> Martin
>
> On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
>> Hi Martin,
>>
>> I'm a relative newbie to solr, have been playing with the spellcheck
>> component and seem to have it working.  I certainly can't explain  
>> what all
>> is going on, but with any luck, I can help you get the spellchecker
>> up-and-running.  Additional replies in-lined below.
>>
>> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <martin.grotzke@javakaffee.de
>>> wrote:
>>
>>> Now I'm thinking about the source-field in the spellchecker  
>>> ("spell"):
>>> how should fields be analyzed during indexing, and how should the
>>> queryAnalyzerFieldType be configured.
>>
>>
>> I followed the conventions in the default solrconfig.xml and  
>> schema.xml
>> files.  So I created a "textSpell" field type (schema.xml):
>>
>>    <!-- field type for the spell checker which doesn't stem -->
>>    <fieldtype name="textSpell" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer>
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>      </analyzer>
>>    </fieldtype>
>>
>> and used this for the queryAnalyzerFieldType.  I also created a  
>> spellField
>> to store the text I want to spell check against and used the same  
>> analyzer
>> (figuring that the query and indexed data should be analyzed the  
>> same way)
>> (schema.xml):
>>
>>   <!-- Spell check field -->
>>   <field name="spellField" type="textSpell" indexed="true"  
>> stored="true" />
>>
>>
>>
>>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them  
>>> (the
>>> field "brand") directly to the "spell" field. The "spell" field is  
>>> of
>>> type "string".
>>
>>
>> We're copying description to spellField.  I'd recommend using a  
>> type like
>> the above textSpell type since "The StringField type is not  
>> analyzed, but
>> indexed/stored verbatim" (schema.xml):
>>
>>  <copyField source="description" dest="spellField" />
>>
>> Other fields like e.g. the product title I would first copy to some
>>> whitespaceTokinized field (field type with  
>>> WhitespaceTokenizerFactory)
>>> and afterwards to the "spell" field. The product title might be e.g.
>>> "Canon EOS 450D EF-S 18-55 mm".
>>
>>
>> Hmm... I'm not sure if this would work as I don't think the  
>> analyzer is
>> applied until after the copy is made.  FWIW, I've had trouble copying
>> multipe fields to spellField (i.e. adding a second copyField w/
>> dest="spellField"), so we just index the spellchecker on a single  
>> field...
>>
>> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to  
>> use a
>>> StandardTokenizerFactory here?
>>
>>
>> I think if you use the same analyzer for indexing and queries, the
>> distinction probably isn't tremendously important.  When I went  
>> searching,
>> it looked like the StandardTokenizer split on non-letters.  I'd  
>> guess the
>> rationale for using the StandardTokenizer is that it won't recommend
>> non-letter characters.  I was seeing some weirdness earlier (no
>> inserts/deletes), but that disappeared now that I'm using the
>> StandardTokenizer.
>>
>> Cheers,
>>
>> Jason
> -- 
> Martin Grotzke
> http://www.javakaffee.de/blog/

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Posted by Walter Underwood <wu...@netflix.com>.

This is why OR is a better choice. With AND, one miss means no results
at all. Spelling suggestions will never be good enough to make AND work.

wunder

On 10/6/08 12:51 AM, "Martin Grotzke" <ma...@javakaffee.de> wrote:

> Hi Jason,
> 
> what about multi-word searches like "harry potter"? When I do a search
> in our index for "harry poter", I get the suggestion "harry
> spotter" (using spellcheck.collate=true and jarowinkler distance).
> Searching for "harry spotter" (we're searching AND, not OR) then gives
> no results. I asume that this is because suggestions are done for words
> separately, and this does not require that both/all suggestions are
> contained in the same document.
> 
> I wonder what's the standard approach for searches with multiple words.
> Are these working ok for you?
> 
> Cheers,
> Martin
> 
> On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
>> Hi Martin,
>> 
>> I'm a relative newbie to solr, have been playing with the spellcheck
>> component and seem to have it working.  I certainly can't explain what all
>> is going on, but with any luck, I can help you get the spellchecker
>> up-and-running.  Additional replies in-lined below.
>> 
>> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <martin.grotzke@javakaffee.de
>>> wrote:
>> 
>>> Now I'm thinking about the source-field in the spellchecker ("spell"):
>>> how should fields be analyzed during indexing, and how should the
>>> queryAnalyzerFieldType be configured.
>> 
>> 
>> I followed the conventions in the default solrconfig.xml and schema.xml
>> files.  So I created a "textSpell" field type (schema.xml):
>> 
>>     <!-- field type for the spell checker which doesn't stem -->
>>     <fieldtype name="textSpell" class="solr.TextField"
>> positionIncrementGap="100">
>>       <analyzer>
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>     </fieldtype>
>> 
>> and used this for the queryAnalyzerFieldType.  I also created a spellField
>> to store the text I want to spell check against and used the same analyzer
>> (figuring that the query and indexed data should be analyzed the same way)
>> (schema.xml):
>> 
>>    <!-- Spell check field -->
>>    <field name="spellField" type="textSpell" indexed="true" stored="true" />
>> 
>> 
>> 
>>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the
>>> field "brand") directly to the "spell" field. The "spell" field is of
>>> type "string".
>> 
>> 
>> We're copying description to spellField.  I'd recommend using a type like
>> the above textSpell type since "The StringField type is not analyzed, but
>> indexed/stored verbatim" (schema.xml):
>> 
>>   <copyField source="description" dest="spellField" />
>> 
>> Other fields like e.g. the product title I would first copy to some
>>> whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
>>> and afterwards to the "spell" field. The product title might be e.g.
>>> "Canon EOS 450D EF-S 18-55 mm".
>> 
>> 
>> Hmm... I'm not sure if this would work as I don't think the analyzer is
>> applied until after the copy is made.  FWIW, I've had trouble copying
>> multipe fields to spellField (i.e. adding a second copyField w/
>> dest="spellField"), so we just index the spellchecker on a single field...
>> 
>> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
>>> StandardTokenizerFactory here?
>> 
>> 
>> I think if you use the same analyzer for indexing and queries, the
>> distinction probably isn't tremendously important.  When I went searching,
>> it looked like the StandardTokenizer split on non-letters.  I'd guess the
>> rationale for using the StandardTokenizer is that it won't recommend
>> non-letter characters.  I was seeing some weirdness earlier (no
>> inserts/deletes), but that disappeared now that I'm using the
>> StandardTokenizer.
>> 
>> Cheers,
>> 
>> Jason