You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jack Tang (JIRA)" <ji...@apache.org> on 2005/10/05 18:19:47 UTC
[jira] Commented: (NUTCH-36) Chinese in Nutch

    [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12331394 ] 

Jack Tang commented on NUTCH-36:
--------------------------------

Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in Summarizer. Say here is one chinese string (c1)(c2)(c3)(c4), the result of bi-gram is:
matched-image     start-offset    end-offset
(c1)(c2)                        0                     2
(c2)(c3)                        1                     3
(c3)(c4)                        2                     4

In search summaries, we should merge the tokens if the index is overlaped. You can follow this:

change code 

          if (highlight.contains(t.termText())) {
            excerpt.addToken(t.termText());
            excerpt.add(new Fragment(text.substring(offset, t.startOffset())));
            excerpt.add(new Highlight(text.substring(t.startOffset(),t.endOffset())));
            offset = t.endOffset();
            endToken = Math.min(j+SUM_CONTEXT, tokens.length);
          }

to

          if (highlight.contains(t.termText())) {
              if(offset * 2 ==  (t.startOffset() + t.endOffset() ))  { // cjk bi-gram
                  excerpt.addToken(t.termText().substring(offset - t.startOffset()));
                  excerpt.add(new Fragment(text.substring(t.startOffset() + 1,offset)));
                  excerpt.add(new Highlight(text.substring(t.startOffset() + 1 ,t.endOffset())));
              }
              else   {
                   excerpt.addToken(t.termText());
                   excerpt.add(new Fragment(text.substring(offset, t.startOffset())));
                   excerpt.add(new Highlight(text.substring(t.startOffset() ,t.endOffset())));
              }
              offset = t.endOffset();
              endToken = Math.min(j+SUM_CONTEXT, tokens.length);
          }


> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor
>  Attachments: &#26700
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira