You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2012/04/27 19:43:56 UTC

CJKBigram filter questons: single character queries, bigrams created across sript/character types

I have a few questions about the CJKBigram filter.

About 10% of our queries that contain Han characters are single character queries.   It looks like the CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input.   This means we would have to create a separate field to index Han unigrams in order to address single character queries.  Is this correct?

For Japanese, the default settings form bigrams across character types.  So for a string containing Hiragana and Han characters bigrams containing a mixture of Hiragana and Han characters are formed:
いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”

Is there a way to specify that you don’t want bigrams across character types?

Tom

Tom Burton-West
Digital Library Production Service
University of Michigan Library

http://www.hathitrust.org/blogs/large-scale-search

Re: CJKBigram filter questons: single character queries, bigrams created across sript/character types

Posted by Lance Norskog <go...@gmail.com>.

I've no experience in the language nuances. I've found that I had to
mix unigram phrase searches with free-text searces in bigram fields.
This is for Chinese language, not Japanese. The bigram idea comes
about apparently because Chinese characters tend to be clumped into
2-3 letter "words", in a way that is not consistent across different
kinds of text. I have no pretense of understanding the whys.

On Mon, Apr 30, 2012 at 2:21 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Thanks wunder,
>
> I really appreciate the help.
>
> Tom
>

-- 
Lance Norskog
goksron@gmail.com

RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Thanks wunder,

I really appreciate the help.

Tom

Re: CJKBigram filter questons: single character queries, bigrams created across sript/character types

Posted by Walter Underwood <wu...@wunderwood.org>.

You'll see katakana used with kanji in noun compounds where one of the words is foreign.

In Japanese, "Rice University" is not written with the kanji word for "rice". They use katakana for "rice" and kanji for "university", like this: ライス大学.

This is very common. I expect that "President Obama" uses kanji for the title and katakana for "Obama".

Removing hiragana is a bad idea. There are some words that are only written in hiragana.

wunder

On Apr 30, 2012, at 1:27 PM, Burton-West, Tom wrote:

> Thanks wunder and Lance,
> 
> In the discussions I've seen of Japanese IR in the English language IR literature, Hiragana is either removed or strings are segmented first by character class.  I'm interested in finding out more about why bigramming across classes is desirable.
> Based on my limited understanding of Japanese, I can see how perhaps bigramming a Han and Hiragana character might make sense but what about Han and Katakana?
> 
> Lance, how did you weight the unigram vs bigram fields for CJK? or did you just OR them together assuming that idf will give the bigrams more weight?
> 
> Tom
>

RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Thanks wunder and Lance,

In the discussions I've seen of Japanese IR in the English language IR literature, Hiragana is either removed or strings are segmented first by character class.  I'm interested in finding out more about why bigramming across classes is desirable.
Based on my limited understanding of Japanese, I can see how perhaps bigramming a Han and Hiragana character might make sense but what about Han and Katakana?

Lance, how did you weight the unigram vs bigram fields for CJK? or did you just OR them together assuming that idf will give the bigrams more weight?

Tom

Re: CJKBigram filter questons: single character queries, bigrams created across sript/character types

Posted by Lance Norskog <go...@gmail.com>.

This does not address the question. A single-ideogram query will not
find ideograms in the middle of phrases.

I have also found that phrase slop does not work with bigrams. At all.
I created a separate field type with unigrams. The CJK fields use the
StandardAnalyzer. I made a stack with just the SA which gives raw Euro
text and single terms for CJK ideograms. This worked well for direct
phrase and phrase slop queries. You should use both kinds of fields-
the bigram search helps boost similar phrases.

You should also try the SmartChineseAnalyzer and new Japanese analyzer
suite. I've discovered that CJK search is a very tricky thing, and
different use cases like different strategies.

On Fri, Apr 27, 2012 at 10:57 AM, Walter Underwood
<wu...@wunderwood.org> wrote:
> Bigrams across character types seems like a useful thing, especially for indexing adjective and verb endings.
>
> An n-gram approach is always going to generate a lot of junk along with the gold. Tighten the rules and good stuff is missed, guaranteed. The only way to sort it out is to use a tokenizer with some linguistic rules.
>
> wunder
>
> On Apr 27, 2012, at 10:43 AM, Burton-West, Tom wrote:
>
>> I have a few questions about the CJKBigram filter.
>>
>> About 10% of our queries that contain Han characters are single character queries.   It looks like the CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input.   This means we would have to create a separate field to index Han unigrams in order to address single character queries.  Is this correct?
>>
>> For Japanese, the default settings form bigrams across character types.  So for a string containing Hiragana and Han characters bigrams containing a mixture of Hiragana and Han characters are formed:
>> いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”
>>
>> Is there a way to specify that you don’t want bigrams across character types?
>>
>> Tom
>>
>> Tom Burton-West
>> Digital Library Production Service
>> University of Michigan Library
>>
>> http://www.hathitrust.org/blogs/large-scale-search
>>
>
>
>
>
>

-- 
Lance Norskog
goksron@gmail.com

Re: CJKBigram filter questons: single character queries, bigrams created across sript/character types

Posted by Walter Underwood <wu...@wunderwood.org>.

Bigrams across character types seems like a useful thing, especially for indexing adjective and verb endings.

An n-gram approach is always going to generate a lot of junk along with the gold. Tighten the rules and good stuff is missed, guaranteed. The only way to sort it out is to use a tokenizer with some linguistic rules.

wunder

On Apr 27, 2012, at 10:43 AM, Burton-West, Tom wrote:

> I have a few questions about the CJKBigram filter.
> 
> About 10% of our queries that contain Han characters are single character queries.   It looks like the CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input.   This means we would have to create a separate field to index Han unigrams in order to address single character queries.  Is this correct?
> 
> For Japanese, the default settings form bigrams across character types.  So for a string containing Hiragana and Han characters bigrams containing a mixture of Hiragana and Han characters are formed:
> いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”
> 
> Is there a way to specify that you don’t want bigrams across character types?
> 
> Tom
> 
> Tom Burton-West
> Digital Library Production Service
> University of Michigan Library
> 
> http://www.hathitrust.org/blogs/large-scale-search
>