You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2014/04/02 19:19:48 UTC

Analysis of Japanese characters

My company is setting up a system for a customer from Japan.  We have an 
existing system that handles primarily English.

Here's my general text analysis chain:

http://apaste.info/xa5

After talking to the customer about problems they are encountering with 
search, we have determined that some of the problems are caused because 
ICUTokenizer splits on *any* character set change, including changes 
between different Japanase character sets.

Knowing the risk of this being an XY problem, here's my question: Can 
someone help me develop a rule file for the ICU Tokenizer that will 
*not* split when the character set changes from one of the japanese 
character sets to another japanese character set, but still split on 
other character set changes?

Thanks,
Shawn

Re: Analysis of Japanese characters

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

No specific answers, but have you read the detailed CJK article
collection: http://discovery-grindstone.blogspot.ca/ . There is a lot
of information there.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Thu, Apr 3, 2014 at 12:19 AM, Shawn Heisey <so...@elyograg.org> wrote:
> My company is setting up a system for a customer from Japan.  We have an
> existing system that handles primarily English.
>
> Here's my general text analysis chain:
>
> http://apaste.info/xa5
>
> After talking to the customer about problems they are encountering with
> search, we have determined that some of the problems are caused because
> ICUTokenizer splits on *any* character set change, including changes between
> different Japanase character sets.
>
> Knowing the risk of this being an XY problem, here's my question: Can
> someone help me develop a rule file for the ICU Tokenizer that will *not*
> split when the character set changes from one of the japanese character sets
> to another japanese character set, but still split on other character set
> changes?
>
> Thanks,
> Shawn
>

Re: Analysis of Japanese characters

Posted by Shawn Heisey <so...@elyograg.org>.

On 4/7/2014 2:07 PM, T. Kuro Kurosaka wrote:
> Tom,
> You should be using JapaneseAnalyzer (kuromoji).
> Neither CJK nor ICU tokenize at word boundaries.

Is JapaneseAnalyzer configurable with regard to what it does with 
non-japanese text?  If it's not, it won't work for me.

We use a combination of tokenizers and filters because there are no full 
analyzers that do what we require.  My analysis chain (for our index 
that's primarily english) has evolved over the last few years into its 
current form:

http://apaste.info/xa5

For our Japanese customer, we have recently changed from 
ICUFoldingFilter to ASCIIFoldingFilter and ICUNormalizer2Filter, because 
they do not want us to fold accent marks on Japanese characters.  I do 
not understand enough about Japanese to have an opinion on this, beyond 
the general "we should normalize EVERYTHING" approach.  The data from 
this customer is not purely Japanese - there is a lot of English as 
well, and quite possibly a small amount of other languages.

Thanks,
Shawn

Re: Analysis of Japanese characters

Posted by "T. Kuro Kurosaka" <ku...@healthline.com>.

Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.

On 04/02/2014 10:33 AM, Tom Burton-West wrote:
> Hi Shawn,
>
> I'm not sure I understand the problem and why you need to solve it at the
> ICUTokenizer level rather than the CJKBigramFilter
> Can you perhaps give a few examples of the problem?
>
> Have you looked at the flags for the CJKBigramfilter?
> You can tell it to make bigrams of different Japanese character sets.  For
> example the config given in the JavaDocs tells it to make bigrams across 3
> of the different Japanese character sets.  (Is the issue related to Romaji?)
>
>   <filter class="solr.CJKBigramFilterFactory"
>         han="true" hiragana="true"
>         katakana="true" hangul="true" outputUnigrams="false" />
>
>
>
> http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html
>
> Tom
>
>
> On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey <so...@elyograg.org> wrote:
>
>> My company is setting up a system for a customer from Japan.  We have an
>> existing system that handles primarily English.
>>
>> Here's my general text analysis chain:
>>
>> http://apaste.info/xa5
>>
>> After talking to the customer about problems they are encountering with
>> search, we have determined that some of the problems are caused because
>> ICUTokenizer splits on *any* character set change, including changes
>> between different Japanase character sets.
>>
>> Knowing the risk of this being an XY problem, here's my question: Can
>> someone help me develop a rule file for the ICU Tokenizer that will *not*
>> split when the character set changes from one of the japanese character
>> sets to another japanese character set, but still split on other character
>> set changes?
>>
>> Thanks,
>> Shawn
>>
>>

Re: Analysis of Japanese characters

Posted by Tom Burton-West <tb...@umich.edu>.

Hi Shawn,

>>For an input of 田中角栄 the bigram filter works like you described, and what
I would expect.  If I add a space at the point where the ICU >>tokenizer
would have split them anyway, the bigram filter output is very different.

If I'm understanding what you are reporting, I suspect this is behavior as
designed.   My guess is that the bigram filter figures that if there was
space in the original input (to the whole filter chain), it should not
create a bigram across it.

Tom

BTW: if you can show a few examples of Japanese queries the show the
original problem  and the reason its a problem (without of course showing
anything proprietary), I'd love to see them.  I'm always interested in
learning more about Japanese query processing.

Re: Analysis of Japanese characters

Posted by Shawn Heisey <so...@elyograg.org>.

On 4/2/2014 2:19 PM, Tom Burton-West wrote:
> Hi Shawn,
>
> I may still be missing your point.  Below is an example where the
> ICUTokenizer splits
> Now, I'm beginning to wonder if I really understand what those flags on the
> CJKBigramFilter do.
> The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
> back together into bigrams.
>
> I thought if you set  han=true, hiragana=true
> You would get this kind of result where the third bigram is composed of a
> hirigana and han character

It looks like you are right.  I did not notice that the bigram filter 
was putting the tokens back together, even though the tokenizer was 
splitting them apart.  I might be worrying over nothing!  Thank you for 
taking some time to point out the obvious.

I did notice something odd, though.  Keep in mind that I have absolutely 
no idea what I am writing here, so I have no idea if this is valid at all:

For an input of 田中角栄 the bigram filter works like you described, and 
what I would expect.  If I add a space at the point where the ICU 
tokenizer would have split them anyway, the bigram filter output is very 
different.  Best guess: It notices that the end/start values from the 
original input are not consecutive, and therefore doesn't combine them.  
Like I said above, I may have nothing at all to worry about here.

Thanks,
Shawn

Re: Analysis of Japanese characters

Posted by Tom Burton-West <tb...@umich.edu>.

Hi Shawn,

I may still be missing your point.  Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.

I thought if you set  han=true, hiragana=true
You would get this kind of result where the third bigram is composed of a
hirigana and han character

いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”

Hopefully the e-mail hasn't munged the output of the Solr analysis panel
below:

I can see this in our query processing where outpugUnigrams=false:
org.apache.solr.analysis.ICUTokenizerFactory {luceneMatchVersion=LUCENE_36}
Splits into unigrams
term text いろは革命歌
org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false,
outputUnigrams=false, katakana=false, han=true, hiragana=true,
luceneMatchVersion=LUCENE_36}
makes bigrams including the middle one which is one character hirigana and
one han
term text いろろはは革革命命歌

It appears that if you include outputUnigrams=true (as we both do in the
indexing configuration) that this doesn't happen.
org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false,
outputUnigrams=true, katakana=false, han=true, hiragana=true ,
luceneMatchVersion=LUCENE_36}
いろは革命歌 革命命歌 type <HIRAGANA><HIRAGANA><HIRAGANA><SINGLE><SINGLE><SINGLE>
<DOUBLE><DOUBLE>

Not sure what happens for katakana as the ICUTokenizer doesn't convert it
to unigrams and our configuration is set to katakana=false.   I'll play
around on the test machine when I have time.

Tom

Re: Analysis of Japanese characters

Posted by Shawn Heisey <so...@elyograg.org>.

On 4/2/2014 11:33 AM, Tom Burton-West wrote:
> Hi Shawn,
>
> I'm not sure I understand the problem and why you need to solve it at the
> ICUTokenizer level rather than the CJKBigramFilter
> Can you perhaps give a few examples of the problem?
>
> Have you looked at the flags for the CJKBigramfilter?
> You can tell it to make bigrams of different Japanese character sets.  For
> example the config given in the JavaDocs tells it to make bigrams across 3
> of the different Japanese character sets.  (Is the issue related to Romaji?)
>
>   <filter class="solr.CJKBigramFilterFactory"
>         han="true" hiragana="true"
>         katakana="true" hangul="true" outputUnigrams="false" />
>
>
>
> http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html
>
> Tom
>
>
> On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey <so...@elyograg.org> wrote:
>
>> My company is setting up a system for a customer from Japan.  We have an
>> existing system that handles primarily English.
>>
>> Here's my general text analysis chain:
>>
>> http://apaste.info/xa5
>>
>> After talking to the customer about problems they are encountering with
>> search, we have determined that some of the problems are caused because
>> ICUTokenizer splits on *any* character set change, including changes
>> between different Japanase character sets.
>>
>> Knowing the risk of this being an XY problem, here's my question: Can
>> someone help me develop a rule file for the ICU Tokenizer that will *not*
>> split when the character set changes from one of the japanese character
>> sets to another japanese character set, but still split on other character
>> set changes?

Because of what ICUTokenizer does, by the time it makes it to the bigram 
filter, they're already separate terms.

Simplifying to english, let's pretend that upper and lowercase letters 
are in different character sets.  Original term is abCD.  You expect 
that by the end of the analysis, you'll have ab bC CD.  With the 
ICUTokenizer, you end up with just ab CD.

The index side is more complex because of outputUnigrams.  We are still 
deciding whether we want to keep that parameter set, but that's a 
separate issue, one that we know how to resolve without help.

Thanks,
Shawn

Re: Analysis of Japanese characters

Posted by Tom Burton-West <tb...@umich.edu>.

Hi Shawn,

I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?

Have you looked at the flags for the CJKBigramfilter?
You can tell it to make bigrams of different Japanese character sets.  For
example the config given in the JavaDocs tells it to make bigrams across 3
of the different Japanese character sets.  (Is the issue related to Romaji?)

 <filter class="solr.CJKBigramFilterFactory"
       han="true" hiragana="true"
       katakana="true" hangul="true" outputUnigrams="false" />

http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html

Tom

On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey <so...@elyograg.org> wrote:

> My company is setting up a system for a customer from Japan.  We have an
> existing system that handles primarily English.
>
> Here's my general text analysis chain:
>
> http://apaste.info/xa5
>
> After talking to the customer about problems they are encountering with
> search, we have determined that some of the problems are caused because
> ICUTokenizer splits on *any* character set change, including changes
> between different Japanase character sets.
>
> Knowing the risk of this being an XY problem, here's my question: Can
> someone help me develop a rule file for the ICU Tokenizer that will *not*
> split when the character set changes from one of the japanese character
> sets to another japanese character set, but still split on other character
> set changes?
>
> Thanks,
> Shawn
>
>