You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Camden Daily <ca...@jaunter.com> on 2011/01/17 20:00:48 UTC

Spell Checking a multi word phrase

Hello all,

I'm pretty new to Solr, and trying to set up a spell checker that can handle
entire phrases.  My goal would be to have something that could offer a
suggestion of "united states" for a query of "untied stats".

I have a very large index, and I've worked a bit with creating shingles for
the spelling index.  The problem I'm running into now is that the
SpellCheckComponent is always tokenizing the query that I pass to it.

For example, a query like this
http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on

The debug information shows me that the parsed query is:
PhraseQuery(text:"untied stats")

But I receive the spelling suggestions for "untied" and "stats" separately.
>From what I understand, this is not a case where I would want to collate; I
simply want the entire phrase treated as one token.

I found the following post after much searching that suggests setting up a
custom QueryConverter:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3C1224516331.3820.119.camel@localhost.localdomain.tld%3E

Does anyone know if that would be required?  I had hoped to avoid Java code
entirely with Solr (I haven't used Java in a very long time), but if I do
need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
able to give me some tips of exactly how I would add that functionality to
Solr?

Relevant configs below:

solrconfig.xml:

  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spellShingle</str>
      <str name="spellcheckIndexDir">./spellShingle</str>
      <str name="queryAnalyzerFieldType">textSpellShingle</str>
      <str name="buildOnOptimize">true</str>
    </lst>
</searchComponent>

schema.xml:

    <fieldType name="textSpellShingle" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

(I had thought setting the KeywordTokenizer for the query analyzer would
keep it from being tokenized, but it doesn't seem to make any difference)

-Camden Daily

Re: Spell Checking a multi word phrase

Posted by Camden Daily <ca...@jaunter.com>.

James,

Thanks, the spellcheck.q was exactly what I needed to be using!

-Camden

On Mon, Jan 17, 2011 at 3:54 PM, Dyer, James <Ja...@ingrambook.com>wrote:

> Camden,
>
> Have you seen Smiley&Pugh's Solr book?  They describe something very
> similar to what you're trying to do on p180ff.  The difference seems to be
> they use a field that only has a couple of terms so they don't bother with
> shingles.  The book makes a big point about using "spellcheck.q" in this
> case in order to get the analysis right.  I'm not sure if this is the
> solution but I thought I'd mention it.  I never tried spell checking this
> way because it seemed very limited and possibly quite expensive.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: Camden Daily [mailto:camden@jaunter.com]
> Sent: Monday, January 17, 2011 1:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Spell Checking a multi word phrase
>
> James,
>
> Thank you, but I'm not sure that will work for my needs.  I'm very
> interested in contextual spell checking.  Take for example the author
> "stephenie meyer".  "stephenie" is a far less popular spelling than
> "stephanie", but in this context it's the correct option.  I feel like
> shingles with an un tokenized query string would be able to catch this, but
> I can't find too many examples of people attempting this.
>
> On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James <James.Dyer@ingrambook.com
> >wrote:
>
> > Camden,
> >
> > You may also want to be aware that there is a new feature added to Spell
> > Check's "collate" functionality that will guarantee the collations will
> > return hits.  It also is able to return more than one collation and tell
> you
> > how many hits each one would result in if re-queried.  This might do the
> > same thing you're trying to do using shingles, but with more accuracy and
> > less work.
> >
> > For info, look at "spellcheck.collate", "spellcheck.maxCollations",
> > "spellcheck.maxCollationTries" & spellcheck.collateExtendedResults" on
> the
> > component's wiki page:
> > http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
> >
> > This feature is committed to 3.x and 4.x and is available as a patch for
> > 1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).
> >
> > James Dyer
> > E-Commerce Systems
> > Ingram Content Group
> > (615) 213-4311
> >
> >
> > -----Original Message-----
> > From: Camden Daily [mailto:camden@jaunter.com]
> > Sent: Monday, January 17, 2011 1:01 PM
> > To: solr-user@lucene.apache.org
> > Subject: Spell Checking a multi word phrase
> >
> > Hello all,
> >
> > I'm pretty new to Solr, and trying to set up a spell checker that can
> > handle
> > entire phrases.  My goal would be to have something that could offer a
> > suggestion of "united states" for a query of "untied stats".
> >
> > I have a very large index, and I've worked a bit with creating shingles
> for
> > the spelling index.  The problem I'm running into now is that the
> > SpellCheckComponent is always tokenizing the query that I pass to it.
> >
> > For example, a query like this
> >
> >
> http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on<http://localhost:8080/solr/spell?q=untied%5Cstats&spellcheck=true&debugQuery=on>
> <
> http://localhost:8080/solr/spell?q=untied%5Cstats&spellcheck=true&debugQuery=on
> >
> >
> > The debug information shows me that the parsed query is:
> > PhraseQuery(text:"untied stats")
> >
> > But I receive the spelling suggestions for "untied" and "stats"
> separately.
> > From what I understand, this is not a case where I would want to collate;
> I
> > simply want the entire phrase treated as one token.
> >
> > I found the following post after much searching that suggests setting up
> a
> > custom QueryConverter:
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3C1224516331.3820.119.camel@localhost.localdomain.tld%3E
> >
> > Does anyone know if that would be required?  I had hoped to avoid Java
> code
> > entirely with Solr (I haven't used Java in a very long time), but if I do
> > need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
> > able to give me some tips of exactly how I would add that functionality
> to
> > Solr?
> >
> > Relevant configs below:
> >
> > solrconfig.xml:
> >
> >  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
> >    <lst name="spellchecker">
> >      <str name="name">default</str>
> >      <str name="field">spellShingle</str>
> >      <str name="spellcheckIndexDir">./spellShingle</str>
> >      <str name="queryAnalyzerFieldType">textSpellShingle</str>
> >      <str name="buildOnOptimize">true</str>
> >    </lst>
> > </searchComponent>
> >
> > schema.xml:
> >
> >    <fieldType name="textSpellShingle" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"/>
> >        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> > outputUnigrams="true"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
> >
> > (I had thought setting the KeywordTokenizer for the query analyzer would
> > keep it from being tokenized, but it doesn't seem to make any difference)
> >
> > -Camden Daily
> >
>

RE: Spell Checking a multi word phrase

Posted by "Dyer, James" <Ja...@ingrambook.com>.

Camden,

Have you seen Smiley&Pugh's Solr book?  They describe something very similar to what you're trying to do on p180ff.  The difference seems to be they use a field that only has a couple of terms so they don't bother with shingles.  The book makes a big point about using "spellcheck.q" in this case in order to get the analysis right.  I'm not sure if this is the solution but I thought I'd mention it.  I never tried spell checking this way because it seemed very limited and possibly quite expensive. 

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Camden Daily [mailto:camden@jaunter.com] 
Sent: Monday, January 17, 2011 1:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Spell Checking a multi word phrase

James,

Thank you, but I'm not sure that will work for my needs.  I'm very
interested in contextual spell checking.  Take for example the author
"stephenie meyer".  "stephenie" is a far less popular spelling than
"stephanie", but in this context it's the correct option.  I feel like
shingles with an un tokenized query string would be able to catch this, but
I can't find too many examples of people attempting this.

On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James <Ja...@ingrambook.com>wrote:

> Camden,
>
> You may also want to be aware that there is a new feature added to Spell
> Check's "collate" functionality that will guarantee the collations will
> return hits.  It also is able to return more than one collation and tell you
> how many hits each one would result in if re-queried.  This might do the
> same thing you're trying to do using shingles, but with more accuracy and
> less work.
>
> For info, look at "spellcheck.collate", "spellcheck.maxCollations",
> "spellcheck.maxCollationTries" & spellcheck.collateExtendedResults" on the
> component's wiki page:
> http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
>
> This feature is committed to 3.x and 4.x and is available as a patch for
> 1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: Camden Daily [mailto:camden@jaunter.com]
> Sent: Monday, January 17, 2011 1:01 PM
> To: solr-user@lucene.apache.org
> Subject: Spell Checking a multi word phrase
>
> Hello all,
>
> I'm pretty new to Solr, and trying to set up a spell checker that can
> handle
> entire phrases.  My goal would be to have something that could offer a
> suggestion of "united states" for a query of "untied stats".
>
> I have a very large index, and I've worked a bit with creating shingles for
> the spelling index.  The problem I'm running into now is that the
> SpellCheckComponent is always tokenizing the query that I pass to it.
>
> For example, a query like this
>
> http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on<http://localhost:8080/solr/spell?q=untied%5Cstats&spellcheck=true&debugQuery=on>
>
> The debug information shows me that the parsed query is:
> PhraseQuery(text:"untied stats")
>
> But I receive the spelling suggestions for "untied" and "stats" separately.
> From what I understand, this is not a case where I would want to collate; I
> simply want the entire phrase treated as one token.
>
> I found the following post after much searching that suggests setting up a
> custom QueryConverter:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3C1224516331.3820.119.camel@localhost.localdomain.tld%3E
>
> Does anyone know if that would be required?  I had hoped to avoid Java code
> entirely with Solr (I haven't used Java in a very long time), but if I do
> need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
> able to give me some tips of exactly how I would add that functionality to
> Solr?
>
> Relevant configs below:
>
> solrconfig.xml:
>
>  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>    <lst name="spellchecker">
>      <str name="name">default</str>
>      <str name="field">spellShingle</str>
>      <str name="spellcheckIndexDir">./spellShingle</str>
>      <str name="queryAnalyzerFieldType">textSpellShingle</str>
>      <str name="buildOnOptimize">true</str>
>    </lst>
> </searchComponent>
>
> schema.xml:
>
>    <fieldType name="textSpellShingle" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> (I had thought setting the KeywordTokenizer for the query analyzer would
> keep it from being tokenized, but it doesn't seem to make any difference)
>
> -Camden Daily
>

Re: Spell Checking a multi word phrase

Posted by Camden Daily <ca...@jaunter.com>.

James,

Thank you, but I'm not sure that will work for my needs.  I'm very
interested in contextual spell checking.  Take for example the author
"stephenie meyer".  "stephenie" is a far less popular spelling than
"stephanie", but in this context it's the correct option.  I feel like
shingles with an un tokenized query string would be able to catch this, but
I can't find too many examples of people attempting this.

On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James <Ja...@ingrambook.com>wrote:

> Camden,
>
> You may also want to be aware that there is a new feature added to Spell
> Check's "collate" functionality that will guarantee the collations will
> return hits.  It also is able to return more than one collation and tell you
> how many hits each one would result in if re-queried.  This might do the
> same thing you're trying to do using shingles, but with more accuracy and
> less work.
>
> For info, look at "spellcheck.collate", "spellcheck.maxCollations",
> "spellcheck.maxCollationTries" & spellcheck.collateExtendedResults" on the
> component's wiki page:
> http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
>
> This feature is committed to 3.x and 4.x and is available as a patch for
> 1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: Camden Daily [mailto:camden@jaunter.com]
> Sent: Monday, January 17, 2011 1:01 PM
> To: solr-user@lucene.apache.org
> Subject: Spell Checking a multi word phrase
>
> Hello all,
>
> I'm pretty new to Solr, and trying to set up a spell checker that can
> handle
> entire phrases.  My goal would be to have something that could offer a
> suggestion of "united states" for a query of "untied stats".
>
> I have a very large index, and I've worked a bit with creating shingles for
> the spelling index.  The problem I'm running into now is that the
> SpellCheckComponent is always tokenizing the query that I pass to it.
>
> For example, a query like this
>
> http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on<http://localhost:8080/solr/spell?q=untied%5Cstats&spellcheck=true&debugQuery=on>
>
> The debug information shows me that the parsed query is:
> PhraseQuery(text:"untied stats")
>
> But I receive the spelling suggestions for "untied" and "stats" separately.
> From what I understand, this is not a case where I would want to collate; I
> simply want the entire phrase treated as one token.
>
> I found the following post after much searching that suggests setting up a
> custom QueryConverter:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3C1224516331.3820.119.camel@localhost.localdomain.tld%3E
>
> Does anyone know if that would be required?  I had hoped to avoid Java code
> entirely with Solr (I haven't used Java in a very long time), but if I do
> need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
> able to give me some tips of exactly how I would add that functionality to
> Solr?
>
> Relevant configs below:
>
> solrconfig.xml:
>
>  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>    <lst name="spellchecker">
>      <str name="name">default</str>
>      <str name="field">spellShingle</str>
>      <str name="spellcheckIndexDir">./spellShingle</str>
>      <str name="queryAnalyzerFieldType">textSpellShingle</str>
>      <str name="buildOnOptimize">true</str>
>    </lst>
> </searchComponent>
>
> schema.xml:
>
>    <fieldType name="textSpellShingle" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> (I had thought setting the KeywordTokenizer for the query analyzer would
> keep it from being tokenized, but it doesn't seem to make any difference)
>
> -Camden Daily
>

RE: Spell Checking a multi word phrase

Posted by "Dyer, James" <Ja...@ingrambook.com>.

Camden,

You may also want to be aware that there is a new feature added to Spell Check's "collate" functionality that will guarantee the collations will return hits.  It also is able to return more than one collation and tell you how many hits each one would result in if re-queried.  This might do the same thing you're trying to do using shingles, but with more accuracy and less work.

For info, look at "spellcheck.collate", "spellcheck.maxCollations", "spellcheck.maxCollationTries" & spellcheck.collateExtendedResults" on the component's wiki page: http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

This feature is committed to 3.x and 4.x and is available as a patch for 1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Camden Daily [mailto:camden@jaunter.com] 
Sent: Monday, January 17, 2011 1:01 PM
To: solr-user@lucene.apache.org
Subject: Spell Checking a multi word phrase

Hello all,

I'm pretty new to Solr, and trying to set up a spell checker that can handle
entire phrases.  My goal would be to have something that could offer a
suggestion of "united states" for a query of "untied stats".

I have a very large index, and I've worked a bit with creating shingles for
the spelling index.  The problem I'm running into now is that the
SpellCheckComponent is always tokenizing the query that I pass to it.

For example, a query like this
http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on

The debug information shows me that the parsed query is:
PhraseQuery(text:"untied stats")

But I receive the spelling suggestions for "untied" and "stats" separately.
>From what I understand, this is not a case where I would want to collate; I
simply want the entire phrase treated as one token.

I found the following post after much searching that suggests setting up a
custom QueryConverter:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3C1224516331.3820.119.camel@localhost.localdomain.tld%3E

Does anyone know if that would be required?  I had hoped to avoid Java code
entirely with Solr (I haven't used Java in a very long time), but if I do
need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
able to give me some tips of exactly how I would add that functionality to
Solr?

Relevant configs below:

solrconfig.xml:

  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spellShingle</str>
      <str name="spellcheckIndexDir">./spellShingle</str>
      <str name="queryAnalyzerFieldType">textSpellShingle</str>
      <str name="buildOnOptimize">true</str>
    </lst>
</searchComponent>

schema.xml:

    <fieldType name="textSpellShingle" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

(I had thought setting the KeywordTokenizer for the query analyzer would
keep it from being tokenized, but it doesn't seem to make any difference)

-Camden Daily