You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2009/05/05 23:39:36 UTC

[jira] Commented: (SOLR-1078) WordDelimiterFilter do wrong word breaking for Thai vowel

    [ https://issues.apache.org/jira/browse/SOLR-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706211#action_12706211 ] 

Yonik Seeley commented on SOLR-1078:
------------------------------------

Are these characters all in the basic multilingual plane?

Here is the relevant code how WordDelimiterFilter characterizes chars:

{code}
  [...]
    } else if (Character.isLowerCase(ch)) {
      return LOWER;
    } else if (Character.isLetter(ch)) {
      return UPPER;
    } else {
      return SUBWORD_DELIM;
    }
{code}



> WordDelimiterFilter do wrong word breaking for Thai vowel
> ---------------------------------------------------------
>
>                 Key: SOLR-1078
>                 URL: https://issues.apache.org/jira/browse/SOLR-1078
>             Project: Solr
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.4
>         Environment: Ubuntu 8.10 64bit
> Java 1.6.0_10
>            Reporter: SIriwat Aumngamsup
>
> With any configuration of schema.xml
> {code:xml}<filter class="solr.WordDelimiterFilterFactory" />{code}
> will do wrong word breaking with Thai characters.
> ----
> Example: "ผู้ ใหญ่ บ้าน"
> Wrong result: 0 => "ผ", 1 => "ใหญ", 2 => "บ", 3 => "าน"
> Expect result: 0 => "ผู้", 1 => "ใหญ่", 2 => "บ้าน"
> ----
> Example2: "ผู้ใหญ่บ้าน" (no space)
> Wrong result: 0 => "ผ", 1 => "ใหญ", 2 => "บ", 3 => "าน" (same result)
> Expect result: 0 => "ผู้ใหญ่บ้าน"
> ----
> There's a similar problem with Drupal (http://drupal.org/node/335928)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.