You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "David Bowen (JIRA)" <ji...@apache.org> on 2009/09/04 07:52:57 UTC
[jira] Commented: (SOLR-1407) SpellingQueryConverter now disallows
underscores and digits in field names (but allows all UTF-8 letters)
[ https://issues.apache.org/jira/browse/SOLR-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751317#action_12751317 ]
David Bowen commented on SOLR-1407:
-----------------------------------
This is perhaps a separate issue, but I think this class should skip search terms containing wildcards, since it doesn't make sense to make spelling suggestions for a term containing a wildcard. Probably it should also skip terms with a fuzzy-match suffix.
Also, it should skip NOT as well as AND and OR.
Something like this:
<pre>
protected Pattern QUERY_REGEX = Pattern.compile("(?:(?!(\\w+:|\\d+)))(\\p{L}|[?*~])+");
protected Pattern WILD_OR_FUZZY = Pattern.compile("[?*~]");
/**
* Converts the original query string to a collection of Lucene Tokens.
* @param original the original query string
* @return a Collection of Lucene Tokens
*/
@Override
public Collection<Token> convert(String original) {
if (original == null) { // this can happen with q.alt = and no query
return Collections.emptyList();
}
Collection<Token> result = new ArrayList<Token>();
//TODO: Extract the words using a simple regex, but not query stuff, and then analyze them to produce the token stream
Matcher matcher = QUERY_REGEX.matcher(original);
TokenStream stream;
while (matcher.find()) {
String word = matcher.group(0);
if (!word.equals("AND") && !word.equals("OR") && !word.equals("NOT")
&& !WILD_OR_FUZZY.matcher(word).find())
{
try {
stream = analyzer.reusableTokenStream("", new StringReader(word));
Token token;
while ((token = stream.next()) != null) {
token.setStartOffset(matcher.start());
token.setEndOffset(matcher.end());
result.add(token);
}
} catch (IOException e) {
}
}
}
return result;
}
</pre>
> SpellingQueryConverter now disallows underscores and digits in field names (but allows all UTF-8 letters)
> ---------------------------------------------------------------------------------------------------------
>
> Key: SOLR-1407
> URL: https://issues.apache.org/jira/browse/SOLR-1407
> Project: Solr
> Issue Type: Improvement
> Components: spellchecker
> Affects Versions: 1.3
> Reporter: David Bowen
> Assignee: Shalin Shekhar Mangar
> Priority: Trivial
> Fix For: 1.4
>
>
> SpellingQueryConverter was extended to cover the full UTF-8 range instead of handling US-ASCII only, but in the process it was broken for field names that contain underscores or digits.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.