You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Che Dong <ch...@hotmail.com> on 2002/09/13 03:43:11 UTC

about bigram based word segment

> I don't know any Asian languages but from earlier experimentations, I
> remember that some time bigram tokenization could hurt matching, e.g.:
> 
> w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> miss a search for w2. w1 w2 w3 would work better.
> 
if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3, 
you search "w1w2" and "w2w1" will return with same the result. isn't it?


with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
will avoid above charactor sequence problem.

According to the stat. the bigram based word segment returned best resutls. but need queryParser parser query with "and" relation by default 

You can try the bigram based word segment at http://search.163.com  in  category search and news search(web page is powered by google).
google's Chinese language analysis is provided by basistech with Dictionary based word segment.
http://www.basistech.com/products/language-analysis/cma.html



Che, Dong

Re: about bigram based word segment

Posted by Herman Chen <hc...@intumit.com>.

I think there's another flaw with the bigram approach when the query
consists of 3+ characters.  i.e. a query of w1w2w3 would match such
text as w1w2w4w2w3.  Currently I do unigram tokenization and perform
auto phrase queries for cjk searches, but performance could take a hit in
large-scale situations.

----- Original Message -----
From: "Che Dong" <ch...@hotmail.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Friday, September 13, 2002 9:43 AM
Subject: about bigram based word segment


> > I don't know any Asian languages but from earlier experimentations, I
> > remember that some time bigram tokenization could hurt matching, e.g.:
> >
> > w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> > miss a search for w2. w1 w2 w3 would work better.
> >
> if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3,
> you search "w1w2" and "w2w1" will return with same the result. isn't it?
>
>
> with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
> or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
> will avoid above charactor sequence problem.
>
> According to the stat. the bigram based word segment returned best
resutls. but need queryParser parser query with "and" relation by default
>
> You can try the bigram based word segment at http://search.163.com  in
category search and news search(web page is powered by google).
> google's Chinese language analysis is provided by basistech with
Dictionary based word segment.
> http://www.basistech.com/products/language-analysis/cma.html
>
>
>
> Che, Dong
>
>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: about bigram based word segment

Posted by Alex Murzaku <mu...@yahoo.com>.

--- Che Dong <ch...@hotmail.com> wrote:
> if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3, 
> you search "w1w2" and "w2w1" will return with same the result. isn't
> it?

That wouldn't be the case if you quote the two characters (therefore
you submit a "phrase query".) But this discussion would be more
appropriate in the user group... 

=====
__________________________________
alex@lissus.com -- http://www.lissus.com

__________________________________________________
Do you Yahoo!?
Yahoo! News - Today's headlines
http://news.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>