You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ed Smiley <es...@ebrary.com> on 2014/04/15 23:04:52 UTC
Odd extra character duplicates in spell checking

Hi,
I am going to make this question pretty short, so I don’t overwhelm with technical details until  the end.
I suspect that some folks may be seeing this issue without the particular configuration we are using.

What our problem is:

  1.  Correctly spelled words are returning as not spelled correctly, with the original, correctly spelled word with a single oddball character appended as multiple suggestions.
  2.  Incorrectly spelled words are returning correct spelling suggestions with a single oddball character appended as multiple suggestions.
  3.  We’re seeing this in Solr 4.5x and 4.7x.

Example:

The return values are all a single character (unicode shown in square brackets).

correction=attitude[2d]
correction=attitude[2f]
correction=attitude[2026]

Spurious characters:

  *   Unicode Character 'HYPHEN-MINUS' (U+002D)
  *   Unicode Character 'SOLIDUS' (U+002F)
  *   Unicode Character 'HORIZONTAL ELLIPSIS' (U+2026)

Anybody see anything like this?  Anybody fix something like this?

Thanks!
—Ed

========================================================================
OK, here’s the gory details:


What we are doing:
We have developed an application that returns  "did you mean” spelling alternatives against a specific (presumably misspelled word).
We’re using the vocabulary of indexed pages of a specified book as the source of the alternatives, so this is not a general dictionary spell check, we are returning only matching alternatives.
So when I say “correctly spelled” I mean they are words found on at least one page.  We are using the collations, so that we restrict ourselves to those pages in one book.
We are having to check for and “fix up” these faulty results.  That’s not a robust or desirable solution.

We are using SolrJ to get the collations,
              private static final String DID_YOU_MEAN_REQUEST_HANDLER = "/spell”;
….
                SolrQuery query = new SolrQuery(q);
query.set("spellcheck", true);
query.set(SpellingParams.SPELLCHECK_COUNT, 10);
query.set(SpellingParams.SPELLCHECK_COLLATE, true);
query.set(SpellingParams.SPELLCHECK_COLLATE_EXTENDED_RESULTS, true);
            query.set("wt", "json");
query.setRequestHandler(DID_YOU_MEAN_REQUEST_HANDLER);
                query.set("shards.qt", DID_YOU_MEAN_REQUEST_HANDLER);
                query.set("shards.tolerant", "true");
etc……

but we can duplicate the behavior without SolrJ with the collations/ misspellingsAndCorrections below:, e.g.:
solr/pg1/spell?q=+doc-id:(810500)+AND+attitudex&spellcheck=true&spellcheck.count=10&spellcheck.collate=true&spellcheck.collateExtendedResults=true&wt=json&qt=%2Fspell&shards.qt=%2Fspell&shards.tolerant=true.out.print


{"responseHeader":{"status":0,"QTime":60},"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]},"spellcheck":{"suggestions":["attitudex",{"numFound":6,"startOffset":21,"endOffset":30,"origFreq":0,"suggestion":[{"word":"attitudes","freq":362486},{"word":"attitu dex","freq":4819},{"word":"atti tudex","freq":3254},{"word":"attit udex","freq":159},{"word":"attitude-","freq":1080},{"word":"attituden","freq":261}]},"correctlySpelled",false,"collation",["collationQuery"," doc-id:(810500) AND attitude-","hits",2,"misspellingsAndCorrections",["attitudex","attitude-"]],"collation",["collationQuery"," doc-id:(810500) AND attitude/","hits",2,"misspellingsAndCorrections",["attitudex","attitude/"]],"collation",["collationQuery"," doc-id:(810500) AND attitude…","hits",2,"misspellingsAndCorrections",["attitudex","attitude…"]]]}}

The configuration is:

<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">

    <lst name="defaults">

      <str name="df">text</str>

      <str name="spellcheck.dictionary">default</str>

      <str name="spellcheck.dictionary">wordbreak</str>

      <str name="spellcheck">on</str>

      <str name="spellcheck.extendedResults">true</str>

      <str name="spellcheck.count">10</str>

      <str name="spellcheck.alternativeTermCount">5</str>

      <str name="spellcheck.maxResultsForSuggest">5</str>

      <str name="spellcheck.collate">true</str>

      <str name="spellcheck.collateExtendedResults">true</str>

      <str name="spellcheck.maxCollationTries">10</str>

      <str name="spellcheck.maxCollations">5</str>

    name="last-components">

      <str>spellcheck</str>

    </arr>

  </requestHandler>


<lst name="spellchecker">

      <str name="name">wordbreak</str>

      <str name="classname">solr.WordBreakSolrSpellChecker</str>

      <str name="field">text</str>

      <str name="combineWords">true</str>

      <str name="breakWords">true</str>

      <int name="maxChanges">25</int>

      <int name="minBreakLength">3</int>

</lst>


<lst name="spellchecker">

      <str name="name">default</str>

      <str name="field">text</str>

      <str name="classname">solr.DirectSolrSpellChecker</str>

      <str name="distanceMeasure">internal</str>

      <float name="accuracy">0.2</float>

      <int name="maxEdits">2</int>

      <int name="minPrefix">1</int>

      <int name="maxInspections">25</int>

      <int name="minQueryLength">4</int>

      <float name="maxQueryFrequency">1</float>

</lst>

--

Ed Smiley, Senior Software Architect, eBooks
ProQuest | 161 E Evelyn Ave|
Mountain View, CA 94041 | USA |
+1 650 475 8700 extension 3772
ed.smiley@proquest.com
www.proquest.com<http://www.proquest.com/> | www.ebrary.com<http://www.ebrary.com/> | www.eblib.com<http://www.eblib.com/>
ebrary and EBL, ProQuest businesses.