You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Maciej Dziardziel <fi...@gmail.com> on 2014/05/03 01:04:55 UTC

Spellchecking - looking for general advice

Hi

I was looking at spellcheck (Direct and FileBased) and testing that they can do.
Direct works fine most of the time, but I'd like to find solution for
few corner cases:

1) having "recruted" and "recruiter" in index, "recruter" should
suggest the latter.
    Obviously the distance to the former is smaller, so it may be
completely arbitrary,
    and perhaps must be handled on application side rather then solr.
2) "restraunt" doesn't suggest "restaurant" - I assume that distance
is to big for that.

Those are few examples of queries that spellcheck gets (according to
my requirements) wrong.
For now I am just looking at possible solutions and I'd need to come
up with initial concept
to have something to show to users and get more feedback, likely with
more cases
to correct.

I'd like to know if there are some tweaks to spellcheck component I
could make (or perhaps other ways of doing this with solr),
or am I forced to hardcode list of all such corrections that go beyond
what spellcheck can do?

One solution I am considering is to put list of those special cases
into FileSpellChecker (it seems to be more relaxed, and handles
restraunt case well) and fall back to Direct if this yields no
results... though I am not sure yet how well that would work in
practice
if the list of misspelled words would grow beyond few I have now. It
would most likely woldn't scale

Another possibility would be to analyze list of queries our users use
that yield little results and check if there is spellchecked
version that improves that... but that seems to require human to
review corrections.

Yet another thing I was thinking about would be to pull terms into
separate spellchecker (like aspell) and see if they do better job or
are more tweakable.

That's a bit open ended problem, so any advice welcome.

--
Maciej Dziardziel
fiedzia@gmail.com

RE: Spellchecking - looking for general advice

Posted by Susheel Kumar <su...@thedigitalgroup.net>.
Got it.  Are you also considering Stemming & Phonetic here.  For e.g. phonetic may catch some of the restaurant variations and recruiter & recruited may convert to base words and at last spell check would have catch all situation.

-----Original Message-----
From: Maciej Dziardziel [mailto:fiedzia@gmail.com]
Sent: Saturday, May 03, 2014 10:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Spellchecking - looking for general advice

Hi

I've set it to 2, but python implementation of Levenshtein says its 3 for restraunt -> restaurant.

On Sat, May 3, 2014 at 2:44 PM, Susheel Kumar <su...@thedigitalgroup.net> wrote:
> How much is the maxEdits you have set. It should catch restaurant example with edit distance set to 2.
>
> Thanks,
> Susheel
>
> -----Original Message-----
> From: Maciej Dziardziel [mailto:fiedzia@gmail.com]
> Sent: Friday, May 02, 2014 7:05 PM
> To: solr-user@lucene.apache.org
> Subject: Spellchecking - looking for general advice
>
> Hi
>
> I was looking at spellcheck (Direct and FileBased) and testing that they can do.
> Direct works fine most of the time, but I'd like to find solution for few corner cases:
>
> 1) having "recruted" and "recruiter" in index, "recruter" should suggest the latter.
>     Obviously the distance to the former is smaller, so it may be completely arbitrary,
>     and perhaps must be handled on application side rather then solr.
> 2) "restraunt" doesn't suggest "restaurant" - I assume that distance is to big for that.
>
> Those are few examples of queries that spellcheck gets (according to my requirements) wrong.
> For now I am just looking at possible solutions and I'd need to come up with initial concept to have something to show to users and get more feedback, likely with more cases to correct.
>
> I'd like to know if there are some tweaks to spellcheck component I could make (or perhaps other ways of doing this with solr), or am I forced to hardcode list of all such corrections that go beyond what spellcheck can do?
>
> One solution I am considering is to put list of those special cases
> into FileSpellChecker (it seems to be more relaxed, and handles
> restraunt case well) and fall back to Direct if this yields no
> results... though I am not sure yet how well that would work in
> practice if the list of misspelled words would grow beyond few I have
> now. It would most likely woldn't scale
>
> Another possibility would be to analyze list of queries our users use that yield little results and check if there is spellchecked version that improves that... but that seems to require human to review corrections.
>
> Yet another thing I was thinking about would be to pull terms into separate spellchecker (like aspell) and see if they do better job or are more tweakable.
>
> That's a bit open ended problem, so any advice welcome.
>
> --
> Maciej Dziardziel
> fiedzia@gmail.com
> This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.



--
Maciej Dziardziel
fiedzia@gmail.com
This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.

Re: Spellchecking - looking for general advice

Posted by Maciej Dziardziel <fi...@gmail.com>.
Hi

I've set it to 2, but python implementation of Levenshtein says its 3
for restraunt -> restaurant.

On Sat, May 3, 2014 at 2:44 PM, Susheel Kumar
<su...@thedigitalgroup.net> wrote:
> How much is the maxEdits you have set. It should catch restaurant example with edit distance set to 2.
>
> Thanks,
> Susheel
>
> -----Original Message-----
> From: Maciej Dziardziel [mailto:fiedzia@gmail.com]
> Sent: Friday, May 02, 2014 7:05 PM
> To: solr-user@lucene.apache.org
> Subject: Spellchecking - looking for general advice
>
> Hi
>
> I was looking at spellcheck (Direct and FileBased) and testing that they can do.
> Direct works fine most of the time, but I'd like to find solution for few corner cases:
>
> 1) having "recruted" and "recruiter" in index, "recruter" should suggest the latter.
>     Obviously the distance to the former is smaller, so it may be completely arbitrary,
>     and perhaps must be handled on application side rather then solr.
> 2) "restraunt" doesn't suggest "restaurant" - I assume that distance is to big for that.
>
> Those are few examples of queries that spellcheck gets (according to my requirements) wrong.
> For now I am just looking at possible solutions and I'd need to come up with initial concept to have something to show to users and get more feedback, likely with more cases to correct.
>
> I'd like to know if there are some tweaks to spellcheck component I could make (or perhaps other ways of doing this with solr), or am I forced to hardcode list of all such corrections that go beyond what spellcheck can do?
>
> One solution I am considering is to put list of those special cases into FileSpellChecker (it seems to be more relaxed, and handles restraunt case well) and fall back to Direct if this yields no results... though I am not sure yet how well that would work in practice if the list of misspelled words would grow beyond few I have now. It would most likely woldn't scale
>
> Another possibility would be to analyze list of queries our users use that yield little results and check if there is spellchecked version that improves that... but that seems to require human to review corrections.
>
> Yet another thing I was thinking about would be to pull terms into separate spellchecker (like aspell) and see if they do better job or are more tweakable.
>
> That's a bit open ended problem, so any advice welcome.
>
> --
> Maciej Dziardziel
> fiedzia@gmail.com
> This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.



-- 
Maciej Dziardziel
fiedzia@gmail.com

RE: Spellchecking - looking for general advice

Posted by Susheel Kumar <su...@thedigitalgroup.net>.
How much is the maxEdits you have set. It should catch restaurant example with edit distance set to 2.

Thanks,
Susheel

-----Original Message-----
From: Maciej Dziardziel [mailto:fiedzia@gmail.com]
Sent: Friday, May 02, 2014 7:05 PM
To: solr-user@lucene.apache.org
Subject: Spellchecking - looking for general advice

Hi

I was looking at spellcheck (Direct and FileBased) and testing that they can do.
Direct works fine most of the time, but I'd like to find solution for few corner cases:

1) having "recruted" and "recruiter" in index, "recruter" should suggest the latter.
    Obviously the distance to the former is smaller, so it may be completely arbitrary,
    and perhaps must be handled on application side rather then solr.
2) "restraunt" doesn't suggest "restaurant" - I assume that distance is to big for that.

Those are few examples of queries that spellcheck gets (according to my requirements) wrong.
For now I am just looking at possible solutions and I'd need to come up with initial concept to have something to show to users and get more feedback, likely with more cases to correct.

I'd like to know if there are some tweaks to spellcheck component I could make (or perhaps other ways of doing this with solr), or am I forced to hardcode list of all such corrections that go beyond what spellcheck can do?

One solution I am considering is to put list of those special cases into FileSpellChecker (it seems to be more relaxed, and handles restraunt case well) and fall back to Direct if this yields no results... though I am not sure yet how well that would work in practice if the list of misspelled words would grow beyond few I have now. It would most likely woldn't scale

Another possibility would be to analyze list of queries our users use that yield little results and check if there is spellchecked version that improves that... but that seems to require human to review corrections.

Yet another thing I was thinking about would be to pull terms into separate spellchecker (like aspell) and see if they do better job or are more tweakable.

That's a bit open ended problem, so any advice welcome.

--
Maciej Dziardziel
fiedzia@gmail.com
This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.