You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Che Dong <ch...@hotmail.com> on 2004/05/30 19:35:58 UTC

Bigram Co-occurrences will be the better way for Word Discrimination. Re: Will CJKAnalyser be release with Lucene 1.4?

> I would be against such a move.  I think Lucene's core has too many 
> analyzers in it already, such as the German and Russian ones.  The core 
> could do without any of the concrete analyzers altogether, in my 
> opinion - but it is handy to have a few general purpose convenience 
> ones.
+1
> 
> What benefit, besides convenience, would there be in CJKAnalyzer into 
> the core?  What about the all the others in the sandbox?  If we bring 
> one in, why not all of them?
but for CJK there is no space for word segment in nature. so the Bigram Co-occurrences will be the better way for  Word Discrimination. 
For example: term C1C2 if segment into C1 and C2 the results will contains C2C1... but in Chinese, the word C1C2 and C2C1 maybe in different meaning.
compare to the the sigram base tokenizer implement in StandardTokenizer the bigram based token will return MUCH better results.

According to my feed back on CJKTokenizer: 
for CJK users, the bigram based CJKTokenzier was strongly recommended for better results.

for more:
Word Discrimination Based on Bigram Co-occurrences
... There is a match routine that detects any common segment between the target word and each of the ... The entries
of the matrix indicate whether a reference word and a lexicon word share at least one n- gram ... It also
shows the bigram match list for an unknown word generated by the feature-matching process ... 
www.ecse.rpi.edu/homepages/nagy/PDF_files/ ElNasan-Nagy-ICDAR01.pdf 

Segmenting Chinese in Unicode
... However, to date no in-depth analysis has been performed analyzing the deficiencies in segmentation
that lead to the improved performance of the simpler bigram methods. ... The part-of-speech of the segment
and the ... A study on integrating Chinese word segmentation and part-of-speech tagging. ... 
www.basistech.com/papers/chinese/iuc-16-paper.pdf 

> 
> It has been brought up to bring in the SnowballAnalyzer - as it 
> actually is general purpose and spans many languages.  I'm not really 
> for bringing that one in either.
> 
> I'm but one voice and would not veto bringing in other analyzers, I 
> just don't think there is much benefit, especially if we improve the 
> release process to incorporate the sandbox goodies into a single 
> distribution but as separate JARs.
> 
> Erik
Thank you,  Erik. Hope we can more communications on this issue with other east Asian Luaguage users.

Che Dong

> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>