You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "David Bowen (JIRA)" <ji...@apache.org> on 2009/09/04 07:52:57 UTC

[jira] Commented: (SOLR-1407) SpellingQueryConverter now disallows underscores and digits in field names (but allows all UTF-8 letters)

    [ https://issues.apache.org/jira/browse/SOLR-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751317#action_12751317 ] 

David Bowen commented on SOLR-1407:
-----------------------------------

This is perhaps a separate issue, but I think this class should skip search terms containing wildcards, since it doesn't make sense to make spelling suggestions for a term containing a wildcard.  Probably it should also skip terms with a fuzzy-match suffix.

Also, it should skip NOT as well as AND and OR.

Something like this:
<pre>
    protected Pattern QUERY_REGEX  = Pattern.compile("(?:(?!(\\w+:|\\d+)))(\\p{L}|[?*~])+");
    protected Pattern WILD_OR_FUZZY = Pattern.compile("[?*~]");

    /**
     * Converts the original query string to a collection of Lucene Tokens.
     * @param original the original query string
     * @return a Collection of Lucene Tokens
     */
    @Override
    public Collection<Token> convert(String original) {
      if (original == null) { // this can happen with q.alt = and no query
        return Collections.emptyList();
      }
      Collection<Token> result = new ArrayList<Token>();
      //TODO: Extract the words using a simple regex, but not query stuff, and then analyze them to produce the token stream
      Matcher matcher = QUERY_REGEX.matcher(original);
      TokenStream stream;
      while (matcher.find()) {
        String word = matcher.group(0);
        if (!word.equals("AND") && !word.equals("OR") && !word.equals("NOT")
            && !WILD_OR_FUZZY.matcher(word).find())
        {
          try {
            stream = analyzer.reusableTokenStream("", new StringReader(word));
            Token token;
            while ((token = stream.next()) != null) {
              token.setStartOffset(matcher.start());
              token.setEndOffset(matcher.end());
              result.add(token);
            }
          } catch (IOException e) {
          }
        }
      }
      return result;
    }
</pre>


> SpellingQueryConverter now disallows underscores and digits in field names (but allows all UTF-8 letters)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1407
>                 URL: https://issues.apache.org/jira/browse/SOLR-1407
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: David Bowen
>            Assignee: Shalin Shekhar Mangar
>            Priority: Trivial
>             Fix For: 1.4
>
>
> SpellingQueryConverter was extended to cover the full UTF-8 range instead of handling US-ASCII only, but in the process it was broken for field names that contain underscores or digits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.