You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jack Tang (JIRA)" <ji...@apache.org> on 2005/10/05 18:19:47 UTC
[jira] Commented: (NUTCH-36) Chinese in Nutch
[ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12331394 ]
Jack Tang commented on NUTCH-36:
--------------------------------
Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in Summarizer. Say here is one chinese string (c1)(c2)(c3)(c4), the result of bi-gram is:
matched-image start-offset end-offset
(c1)(c2) 0 2
(c2)(c3) 1 3
(c3)(c4) 2 4
In search summaries, we should merge the tokens if the index is overlaped. You can follow this:
change code
if (highlight.contains(t.termText())) {
excerpt.addToken(t.termText());
excerpt.add(new Fragment(text.substring(offset, t.startOffset())));
excerpt.add(new Highlight(text.substring(t.startOffset(),t.endOffset())));
offset = t.endOffset();
endToken = Math.min(j+SUM_CONTEXT, tokens.length);
}
to
if (highlight.contains(t.termText())) {
if(offset * 2 == (t.startOffset() + t.endOffset() )) { // cjk bi-gram
excerpt.addToken(t.termText().substring(offset - t.startOffset()));
excerpt.add(new Fragment(text.substring(t.startOffset() + 1,offset)));
excerpt.add(new Highlight(text.substring(t.startOffset() + 1 ,t.endOffset())));
}
else {
excerpt.addToken(t.termText());
excerpt.add(new Fragment(text.substring(offset, t.startOffset())));
excerpt.add(new Highlight(text.substring(t.startOffset() ,t.endOffset())));
}
offset = t.endOffset();
endToken = Math.min(j+SUM_CONTEXT, tokens.length);
}
> Chinese in Nutch
> ----------------
>
> Key: NUTCH-36
> URL: http://issues.apache.org/jira/browse/NUTCH-36
> Project: Nutch
> Type: Improvement
> Components: indexer, searcher
> Environment: all
> Reporter: Jack Tang
> Priority: Minor
> Attachments: 桌
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word.
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira