You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "James Dyer (JIRA)" <ji...@apache.org> on 2010/09/08 18:51:34 UTC

[jira] Issue Comment Edited: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

    [ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907304#action_12907304 ] 

James Dyer edited comment on SOLR-2010 at 9/8/10 12:50 PM:
-----------------------------------------------------------

Two new versions of the patch:

1. SOLR-2010_shardSearchHandler_993538.patch is the same as the 8/23/2010 version except it applies cleanly to trunk revision #993538.  In a Distributed setup, this version calls an overloaded method on SearchHandler to use its logic for combining results from the collation test queries.  This is simpler code but requires many more round-trips between shards.  We also can guarantee that a Distributed setup will always return the exact same collations in order as a non-Distributed setup.  

2. SOLR-2010_shardRecombineCollations_993538.patch is similar to the 8/19/2010 version, with improvements.  This version also applies cleanly to trunk revision #993538.  In a Distributed setup, each shard calls QueryComponent individually and generates its own list of Collations.  The SpellCheckComponent then combines and sorts the resulting collations, returning the best ones, up to the client-specified maximum.  This requires more complicated logic in SpellCheckComponent.finishStage(), although it does not necessitate changes to SearchHandler or ResponseBuilder.  It may be possible to find cases where a Distributed setup may return different collations--or the same collations in a different order--than a non-distributed setup.  I do not believe this potential disparity would ever be very significant.

Grant, I believe version 1 is something like what you were thinking of on 8/9 and 8/19.  Version 2 is more like what you describe in your comment from 8/30.  Let me know if you think this needs any more tweaking.  ALSO, if you're thinking of possibly committing this someday, you may want to look at SOLR-2083 also.  Based on my understanding, distributed SpellCheckComponent as exists currently in Trunk is broken.  (If I'm right), we may want to fix it before adding on more functionality.

      was (Author: jdyer):
    Two new versions of the patch:

1. SOLR-2010_shardSearchHandler_993538.patch is the same as the 8/23/2010 version except it applies cleanly to trunk revision #993538.  In a Distributed setup, this version calls an overloaded method on SearchHandler to use its logic for combining results from the collation test queries.  This is simpler code but requires many more round-trips between shards.  We also can guarantee that a Distributed setup will always return the exact same collations in order as a non-Distributed setup.  

2. SOLR-2010_shardRecombineCollations_993538.patch is similar to the 8/19/2010 version, with improvements.  This version also applies cleanly to trunk revision #993538.  In a Distributed setup, each shard calls QueryComponent individually and generates its own list of Collations.  The SpellCheckComponent then combines and sorts the resulting collations, returning the best ones, up to the client-specified maximum.  This requires more complicated logic in SpellCheckComponent.finishStage(), although it does not necessitate changes to SearchHandler or ResponseBuilder.  It may be possible to find cases where a Distributed setup may return different collations--or the same collations in a different order--than a non-distributed setup.  I do not believe this potential disparity would ever be very significant.

Grant, I believe version 1 is something like what you were thinking of on 8/9 and 8/19.  Version 2 is more like what you describe in your comment from 8/30.  Let me know if you think this needs any more tweaking.  ALSO, if you're thinking of possibly committing this someday, you may want to look at SOLR-2049 also.  Based on my understanding, distributed SpellCheckComponent as exists currently in Trunk is broken.  (If I'm right), we may want to fix it before adding on more functionality.
  
> Improvements to SpellCheckComponent Collate functionality
> ---------------------------------------------------------
>
>                 Key: SOLR-2010
>                 URL: https://issues.apache.org/jira/browse/SOLR-2010
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java, spellchecker
>    Affects Versions: 1.4.1
>         Environment: Tested against trunk revision 966633
>            Reporter: James Dyer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_shardRecombineCollations_993538.patch, SOLR-2010_shardSearchHandler_993538.patch
>
>
> Improvements to SpellCheckComponent Collate functionality
> Our project requires a better Spell Check Collator.  I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features.
> 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also).  This is especially helpful when there is more than one correction per query.  The 1.4 behavior does not verify that a particular combination will actually return hits.
> 2. Provide the option to get multiple collation suggestions
> 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction.
> This patch is similar to what is described in SOLR-507 item #1.  Also, this patch provides a viable workaround for the problem discussed in SOLR-1074.  A dictionary could be created that combines the terms from the multiple fields.  The collator then would prune out any spurious suggestions this would cause.
> This patch adds the following spellcheck parameters:
> 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up.  Lower values ensure better performance.  Higher values may be necessary to find a collation that can return results.  Default is 0, which maintains backwards-compatible behavior (do not check collations).
> 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 1, which maintains backwards-compatible behavior.
> 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found.  default is false, which maintains backwards-compatible behavior.  When true, output is like this (in context):
> <lst name="spellcheck">
> 	<lst name="suggestions">
> 		<lst name="hopq">
> 			<int name="numFound">94</int>
> 			<int name="startOffset">7</int>
> 			<int name="endOffset">11</int>
> 			<arr name="suggestion">
> 				<str>hope</str>
> 				<str>how</str>
> 				<str>hope</str>
> 				<str>chops</str>
> 				<str>hoped</str>
> 				etc
> 			</arr>
> 		<lst name="faill">
> 			<int name="numFound">100</int>
> 			<int name="startOffset">16</int>
> 			<int name="endOffset">21</int>
> 			<arr name="suggestion">
> 				<str>fall</str>
> 				<str>fails</str>
> 				<str>fail</str>
> 				<str>fill</str>
> 				<str>faith</str>
> 				<str>all</str>
> 				etc
> 			</arr>
> 		</lst>
> 		<lst name="collation">
> 			<str name="collationQuery">Title:(how AND fails)</str>
> 			<int name="hits">2</int>
> 			<lst name="misspellingsAndCorrections">
> 				<str name="hopq">how</str>
> 				<str name="faill">fails</str>
> 			</lst>
> 		</lst>
> 		<lst name="collation">
> 			<str name="collationQuery">Title:(hope AND faith)</str>
> 			<int name="hits">2</int>
> 			<lst name="misspellingsAndCorrections">
> 				<str name="hopq">hope</str>
> 				<str name="faill">faith</str>
> 			</lst>
> 		</lst>
> 		<lst name="collation">
> 			<str name="collationQuery">Title:(chops AND all)</str>
> 			<int name="hits">1</int>
> 			<lst name="misspellingsAndCorrections">
> 				<str name="hopq">chops</str>
> 				<str name="faill">all</str>
> 			</lst>
> 		</lst>
> 	</lst>
> </lst>
> In addition, SOLRJ is updated to include SpellCheckResponse.getCollatedResults(), which will return the expanded Collation format.  getCollatedResult(), which returns a single String, is retained for backwards-compatibility.  Other APIs were not changed but will still work provided that spellcheck.collateExtendedResult is false.
> This likely will not return valid results if using Shards.  Rather, a more robust interaction with the index would be necessary than what exists in SpellCheckCollator.collate().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org