You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2011/02/04 18:46:54 UTC

Bigrams for CJK with ICUTokenizer ?

Hello all,

We are using the ICUTokenizer because we have documents in about 400 different languages.   We are also setting autoGeneratePhraseQueries to false so that CJK and other languages that don't use space to separate words won't get tokenized properly by the ICUTokenizer and then the tokens automatically searched as a phrase.

 The ICUTokenizer emits unigrams for Chinese (HAN). We would prefer to use overlapping bigrams as in the CJKAnalyzer.   Is it possible to configure the ICUTokenizer to emit overlapping bigrams?

Alternatively, is there some way to put some filter in the filter chain after the ICUTokenizer that would produce overlapping bigrams for CJK?

Tom Burton-West

RE: Bigrams for CJK with ICUTokenizer ?

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Thanks Robert,

I opened up LUCENE 2906. But I just realized in the effort to keep the description short, I forgot to include your option of producing both unigrams and bigrams, which is a nice option.

Tom

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Friday, February 04, 2011 3:19 PM
To: java-user@lucene.apache.org
Subject: Re: Bigrams for CJK with ICUTokenizer ?

On Fri, Feb 4, 2011 at 3:07 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Thanks Robert,
>
> Lucene 2740 looks really interesting.  In the meantime a JIRA issue for this sounds like a good idea since I'm guessing other people would like to use the ICUTokenizer but would also like bigrams for CJK.
>
> I'm a bit confused over the relationship of the queryparser to the filter chain and whether a filter in the chain after the ICUTokenizer could construct bigrams if the ICUTokenizer is spitting out unigrams and the queryparser is then converting the unigrams to a Boolean clauses (i.e. autoGeneratePhraseQueries=false.)

the QP only sees two things:
1. the input string, which it parses before the analyzer
2. the result of the entire analyzer (tokenizer and all filters).

So in this case, only #2 would be different, as the entire analyzer
would output AB, BC instead of A, B, C
With your settings, for an input of ABC, you will get a regular
boolean query with AB, BC.
If the user puts "ABC" in quotes though, you will get a phrase query of "AB BC"

>
> If ABC is a string of Han characters and the ICUTokenizer spit out unigrams A B C  (and we have autoGeneratePhraseQueries set to false) won't the next filter in the chain get each of the unigrams in a Boolean clause one at a time?  I guess I don't see how the next filter in the chain can reassemble the unigrams into overlapping bigrams.   Maybe I'm not understanding how tokens get passed from one filter to the next when one of the filters (or in this case the tokenizer) breaks a token up into multiple tokens.

In this case it works just like a selective shinglefilter?

>
> Or am I getting index time analysis confused with query time analysis?
> Did you mean that ICUTokenizer could be modified to output bigrams  or that a filter could be designed that would take the output of the ICUTokenizer and create shingles on tokens with the attribute for Han?
>

I think the latter. this way, we can provide the most options: unigram
(what it does by default: A,B,C), but also filters for bigram (AB BC),
or unibigram  (A, AB, B, BC, C)
This is why i said, we can make these filters experimental for now,
because ideally at some point you will be able to use shinglefilter
"conditionally" over the ScriptAttribute for these use-cases, without
having to have a special filter.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Bigrams for CJK with ICUTokenizer ?

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Feb 4, 2011 at 3:07 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Thanks Robert,
>
> Lucene 2740 looks really interesting.  In the meantime a JIRA issue for this sounds like a good idea since I'm guessing other people would like to use the ICUTokenizer but would also like bigrams for CJK.
>
> I'm a bit confused over the relationship of the queryparser to the filter chain and whether a filter in the chain after the ICUTokenizer could construct bigrams if the ICUTokenizer is spitting out unigrams and the queryparser is then converting the unigrams to a Boolean clauses (i.e. autoGeneratePhraseQueries=false.)

the QP only sees two things:
1. the input string, which it parses before the analyzer
2. the result of the entire analyzer (tokenizer and all filters).

So in this case, only #2 would be different, as the entire analyzer
would output AB, BC instead of A, B, C
With your settings, for an input of ABC, you will get a regular
boolean query with AB, BC.
If the user puts "ABC" in quotes though, you will get a phrase query of "AB BC"

>
> If ABC is a string of Han characters and the ICUTokenizer spit out unigrams A B C  (and we have autoGeneratePhraseQueries set to false) won't the next filter in the chain get each of the unigrams in a Boolean clause one at a time?  I guess I don't see how the next filter in the chain can reassemble the unigrams into overlapping bigrams.   Maybe I'm not understanding how tokens get passed from one filter to the next when one of the filters (or in this case the tokenizer) breaks a token up into multiple tokens.

In this case it works just like a selective shinglefilter?

>
> Or am I getting index time analysis confused with query time analysis?
> Did you mean that ICUTokenizer could be modified to output bigrams  or that a filter could be designed that would take the output of the ICUTokenizer and create shingles on tokens with the attribute for Han?
>

I think the latter. this way, we can provide the most options: unigram
(what it does by default: A,B,C), but also filters for bigram (AB BC),
or unibigram  (A, AB, B, BC, C)
This is why i said, we can make these filters experimental for now,
because ideally at some point you will be able to use shinglefilter
"conditionally" over the ScriptAttribute for these use-cases, without
having to have a special filter.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Bigrams for CJK with ICUTokenizer ?

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Thanks Robert,

Lucene 2740 looks really interesting.  In the meantime a JIRA issue for this sounds like a good idea since I'm guessing other people would like to use the ICUTokenizer but would also like bigrams for CJK.

I'm a bit confused over the relationship of the queryparser to the filter chain and whether a filter in the chain after the ICUTokenizer could construct bigrams if the ICUTokenizer is spitting out unigrams and the queryparser is then converting the unigrams to a Boolean clauses (i.e. autoGeneratePhraseQueries=false.)  

If ABC is a string of Han characters and the ICUTokenizer spit out unigrams A B C  (and we have autoGeneratePhraseQueries set to false) won't the next filter in the chain get each of the unigrams in a Boolean clause one at a time?  I guess I don't see how the next filter in the chain can reassemble the unigrams into overlapping bigrams.   Maybe I'm not understanding how tokens get passed from one filter to the next when one of the filters (or in this case the tokenizer) breaks a token up into multiple tokens.

Or am I getting index time analysis confused with query time analysis?

Did you mean that ICUTokenizer could be modified to output bigrams  or that a filter could be designed that would take the output of the ICUTokenizer and create shingles on tokens with the attribute for Han?

Tom

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Friday, February 04, 2011 12:58 PM
To: java-user@lucene.apache.org
Subject: Re: Bigrams for CJK with ICUTokenizer ?

On Fri, Feb 4, 2011 at 12:46 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hello all,
>
> We are using the ICUTokenizer because we have documents in about 400 different languages.   We are also setting autoGeneratePhraseQueries to false so that CJK and other languages that don't use space to separate words won't get tokenized properly by the ICUTokenizer and then the tokens automatically searched as a phrase.
>
>  The ICUTokenizer emits unigrams for Chinese (HAN). We would prefer to use overlapping bigrams as in the CJKAnalyzer.   Is it possible to configure the ICUTokenizer to emit overlapping bigrams?
>
> Alternatively, is there some way to put some filter in the filter chain after the ICUTokenizer that would produce overlapping bigrams for CJK?
>

Hi Tom, Let's open JIRA issue for this, we can add it.
The gist of it, is that ICUTokenizer sets a ScriptAttribute (an
integer) per token indicating its writing system.
So its easy to make an efficient filter that basically only "shingles"
on this attribute.

The reason there isnt one, is because I'd really like for us to
eventually somehow solve this with
https://issues.apache.org/jira/browse/LUCENE-2470

But for now, i think it would be good to be practical and add the
explicit filter (we can just mark the api experimental, hoping we will
make it more general with 2470) so people can easily get good out of
box performance in situations like yours.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Bigrams for CJK with ICUTokenizer ?

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Feb 4, 2011 at 12:46 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hello all,
>
> We are using the ICUTokenizer because we have documents in about 400 different languages.   We are also setting autoGeneratePhraseQueries to false so that CJK and other languages that don't use space to separate words won't get tokenized properly by the ICUTokenizer and then the tokens automatically searched as a phrase.
>
>  The ICUTokenizer emits unigrams for Chinese (HAN). We would prefer to use overlapping bigrams as in the CJKAnalyzer.   Is it possible to configure the ICUTokenizer to emit overlapping bigrams?
>
> Alternatively, is there some way to put some filter in the filter chain after the ICUTokenizer that would produce overlapping bigrams for CJK?
>

Hi Tom, Let's open JIRA issue for this, we can add it.
The gist of it, is that ICUTokenizer sets a ScriptAttribute (an
integer) per token indicating its writing system.
So its easy to make an efficient filter that basically only "shingles"
on this attribute.

The reason there isnt one, is because I'd really like for us to
eventually somehow solve this with
https://issues.apache.org/jira/browse/LUCENE-2470

But for now, i think it would be good to be practical and add the
explicit filter (we can just mark the api experimental, hoping we will
make it more general with 2470) so people can easily get good out of
box performance in situations like yours.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org