You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Teruhiko Kurosaka <Ku...@basistech.com> on 2006/10/20 01:54:49 UTC

RE: I modify NutchAnalysis.jj and NutchDocumentTokenizer.java to let nutch support chinese word.

 
> From: heack [mailto:kongyang217@gmail.com] 
> Sent: 2006-9-13 7:03
> To: nutch-dev@lucene.apache.org
> Subject: I modify NutchAnalysis.jj and 
> NutchDocumentTokenizer.java to let nutch support chinese word.
> 
> After that I test it, and I use luke to see the index, The 
> word is parsed in my way,  but I cannot search any results if 
> my keyword is chinese, but not english words.

Heak,
I was having the same experience. I guessed that NutchAnalysis.jj
needs to be modified so that it does not break the CJK words into
individual characters. That is, getting rid of SIGRAM and make the
CJK characters a part of LETTER. Is this what you did, and you
didn't get the result you want?


-kuro