You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2009/06/06 07:31:08 UTC
[jira] Issue Comment Edited: (SOLR-1204) Enhance SpellingQueryConverter to handle UTF-8 instead of ASCII only

    [ https://issues.apache.org/jira/browse/SOLR-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716833#action_12716833 ] 

Shalin Shekhar Mangar edited comment on SOLR-1204 at 6/5/09 10:29 PM:
----------------------------------------------------------------------

{quote}
In order to produce a correct patch, I need to know what are legal field names. It can hardly be "any UTF-8 string" as that will also contain the colon, which is already used to delimit field names from query strings. What about digits? Asterisk? Dash (minus)? Underscore? Space? Tabulator?
{quote}

Lucene does not limit the field names. Those special characters are actually limitations of our query parser syntax. However, you are right, we need to view them from Solr's point of view. Let us try to limit this to valid Java identifiers or the closest that we can get to them.

      was (Author: shalinmangar):
    {quote}
In order to produce a correct patch, I need to know what are legal field names. It can hardly be "any UTF-8 string" as that will also contain the colon, which is already used to delimit field names from query strings. What about digits? Asterisk? Dash (minus)? Underscore? Space? Tabulator?
{quote}

Lucene does not limit the field names. Those special characters are actually limitations of our query parser syntax. However, you are right, we need to view them from Solr's point of view. Let us try to limit this to valid Java identifiers or the closes that we can get to them.
  
> Enhance SpellingQueryConverter to handle UTF-8 instead of ASCII only
> --------------------------------------------------------------------
>
>                 Key: SOLR-1204
>                 URL: https://issues.apache.org/jira/browse/SOLR-1204
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Michael Ludwig
>            Assignee: Shalin Shekhar Mangar
>            Priority: Trivial
>             Fix For: 1.4
>
>         Attachments: SpellingQueryConverter.java.diff, SpellingQueryConverter.java.diff
>
>
> Solr - User - SpellCheckComponent: queryAnalyzerFieldType
> http://www.nabble.com/SpellCheckComponent%3A-queryAnalyzerFieldType-td23870668.html
> In the above thread, it was suggested to extend the SpellingQueryConverter to cover the full UTF-8 range instead of handling US-ASCII only. This might be as simple as changing the regular expression used to tokenize the input string to accept a sequence of one or more Unicode letters ( \p{L}+ ) instead of a sequence of one or more word characters ( \w+ ).
> See http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html for Java regular expression reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.