You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Tomás Fernández Löbbe <to...@gmail.com> on 2011/03/11 16:29:30 UTC

Multiple Japanese Alphabets in Solr

This question is probably not a completely Solr question but it's related to
it. I'm dealing with a Japanese Solr application in which I would like to be
able to search in any of the Japanese Alphabets. The content can also be in
any Japanese Alphabet. I've been thinking in this solution: Convert
everything to roma-ji, on Index time and query time.
For example:

Indexing time:
[Something in Hiragana] --> translate it to roma-ji --> index

Searching time:
[Something in Katakana] --> translate it to roma-ji --> search
or
[Something in Kanji] --> translate it to roma-ji --> search

I don't have a deep understanding of Japanese, and that's my problem. Did
somebody in the list tried something like this before? Did it work?


Thanks,

Tomás

Re: Multiple Japanese Alphabets in Solr

Posted by François Schiettecatte <fs...@gmail.com>.

You could certainly do it that way if you wanted. 

The one point I would make here is that from a linguistic POV these are not synonyms but are the same term written in a different alphabet.

François

On Mar 11, 2011, at 12:51 PM, Walter Underwood wrote:

> Sounds more like generating synonyms than conflating everything to one set of kana.
> 
> Why not a filter that does that transliteration and adds a token at the some position?
> 
> wunder
> 
> On Mar 11, 2011, at 9:49 AM, Tomás Fernández Löbbe wrote:
> 
>> "the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ'
>> or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration
>> will miss results."
>> Exactly, that's my problem, searching on a different alphabet than the one
>> on which it was indexed a document.
>> François, thank you for your help. Have you used the new ICU Filters? Do
>> they work OK? (I know it doesn't do Kanji)
>> 
>> Tomás
>> 
>> 2011/3/11 François Schiettecatte <fs...@gmail.com>
>> 
>>> Good question about transliteration, the issue has to do with recall, for
>>> example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana
>>> respectively), not doing the transliteration will miss results. You will
>>> find that the big search engines do the transliteration for you
>>> automatically. This issue get even more complicated when you dig into
>>> orthographic variation because Japanese orthography is very variable (ie
>>> there is more than one way to write a 'word'), as is tokenization (ie there
>>> is more than one way to tokenize it), see:
>>> 
>>>      http://www.cjk.org/cjk/reference/japvar.htm
>>> 
>>> I have used the Basis Technology software in the past, it is very good, but
>>> it is also very expensive.
>>> 
>>> François
>>> 
>>> On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote:
>>> 
>>>> Why not index it as-is? Solr can handle Unicode.
>>>> 
>>>> Transliterating hiragana to katakana is a very weird idea. I cannot
>>> imagine how that would help.
>>>> 
>>>> You will need some sort of tokenization to find word boundaries. N-grams
>>> work OK for search, but are really ugly for highlighting.
>>>> 
>>>> As far as I know, there are no good-quality free tokenizers for Japanese.
>>> Basis Technology sells Japanese support that works with Lucene and Solr.
>>>> 
>>>> wunder
>>>> 
>>>> On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:
>>>> 
>>>>> Tomás
>>>>> 
>>>>> That wont really work, transliteration to Romaji works for individual
>>> terms only so you would need to tokenize the Japanese prior to
>>> transliteration. I am not sure what tool you plan to use for
>>> transliteration, I have used ICU in the past and from what I can tell it
>>> does not transliterates Kanji. Besides transliterating Kanji is debatable
>>> for a variety of reasons.
>>>>> 
>>>>> What I would suggest is that you transliterate Hiragana to Katakana,
>>> leave the Kanji alone, and index/search using ngrams. If you want 'proper'
>>> tokenization I would recommend Mecab.
>>>>> 
>>>>> I have looked into this for a client and there is no clear cut solution.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> François
>>>>> 
>>>>> 
>>>>> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
>>>>> 
>>>>>> This question is probably not a completely Solr question but it's
>>> related to
>>>>>> it. I'm dealing with a Japanese Solr application in which I would like
>>> to be
>>>>>> able to search in any of the Japanese Alphabets. The content can also
>>> be in
>>>>>> any Japanese Alphabet. I've been thinking in this solution: Convert
>>>>>> everything to roma-ji, on Index time and query time.
>>>>>> For example:
>>>>>> 
>>>>>> Indexing time:
>>>>>> [Something in Hiragana] --> translate it to roma-ji --> index
>>>>>> 
>>>>>> Searching time:
>>>>>> [Something in Katakana] --> translate it to roma-ji --> search
>>>>>> or
>>>>>> [Something in Kanji] --> translate it to roma-ji --> search
>>>>>> 
>>>>>> I don't have a deep understanding of Japanese, and that's my problem.
>>> Did
>>>>>> somebody in the list tried something like this before? Did it work?
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Tomás
>>>>> 
>>>> 
>>>> --
>>>> Walter Underwood
>>>> Venture ASM, Troop 14, Palo Alto
>>>> 
>>>> 
>>>> 
>>> 
>>> 
> 
> --
> Walter Underwood
> Venture ASM, Troop 14, Palo Alto
> 
> 
>

Re: Multiple Japanese Alphabets in Solr

Posted by Walter Underwood <wu...@wunderwood.org>.

Sounds more like generating synonyms than conflating everything to one set of kana.

Why not a filter that does that transliteration and adds a token at the some position?

wunder

On Mar 11, 2011, at 9:49 AM, Tomás Fernández Löbbe wrote:

> "the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ'
> or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration
> will miss results."
> Exactly, that's my problem, searching on a different alphabet than the one
> on which it was indexed a document.
> François, thank you for your help. Have you used the new ICU Filters? Do
> they work OK? (I know it doesn't do Kanji)
> 
> Tomás
> 
> 2011/3/11 François Schiettecatte <fs...@gmail.com>
> 
>> Good question about transliteration, the issue has to do with recall, for
>> example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana
>> respectively), not doing the transliteration will miss results. You will
>> find that the big search engines do the transliteration for you
>> automatically. This issue get even more complicated when you dig into
>> orthographic variation because Japanese orthography is very variable (ie
>> there is more than one way to write a 'word'), as is tokenization (ie there
>> is more than one way to tokenize it), see:
>> 
>>       http://www.cjk.org/cjk/reference/japvar.htm
>> 
>> I have used the Basis Technology software in the past, it is very good, but
>> it is also very expensive.
>> 
>> François
>> 
>> On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote:
>> 
>>> Why not index it as-is? Solr can handle Unicode.
>>> 
>>> Transliterating hiragana to katakana is a very weird idea. I cannot
>> imagine how that would help.
>>> 
>>> You will need some sort of tokenization to find word boundaries. N-grams
>> work OK for search, but are really ugly for highlighting.
>>> 
>>> As far as I know, there are no good-quality free tokenizers for Japanese.
>> Basis Technology sells Japanese support that works with Lucene and Solr.
>>> 
>>> wunder
>>> 
>>> On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:
>>> 
>>>> Tomás
>>>> 
>>>> That wont really work, transliteration to Romaji works for individual
>> terms only so you would need to tokenize the Japanese prior to
>> transliteration. I am not sure what tool you plan to use for
>> transliteration, I have used ICU in the past and from what I can tell it
>> does not transliterates Kanji. Besides transliterating Kanji is debatable
>> for a variety of reasons.
>>>> 
>>>> What I would suggest is that you transliterate Hiragana to Katakana,
>> leave the Kanji alone, and index/search using ngrams. If you want 'proper'
>> tokenization I would recommend Mecab.
>>>> 
>>>> I have looked into this for a client and there is no clear cut solution.
>>>> 
>>>> Cheers
>>>> 
>>>> François
>>>> 
>>>> 
>>>> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
>>>> 
>>>>> This question is probably not a completely Solr question but it's
>> related to
>>>>> it. I'm dealing with a Japanese Solr application in which I would like
>> to be
>>>>> able to search in any of the Japanese Alphabets. The content can also
>> be in
>>>>> any Japanese Alphabet. I've been thinking in this solution: Convert
>>>>> everything to roma-ji, on Index time and query time.
>>>>> For example:
>>>>> 
>>>>> Indexing time:
>>>>> [Something in Hiragana] --> translate it to roma-ji --> index
>>>>> 
>>>>> Searching time:
>>>>> [Something in Katakana] --> translate it to roma-ji --> search
>>>>> or
>>>>> [Something in Kanji] --> translate it to roma-ji --> search
>>>>> 
>>>>> I don't have a deep understanding of Japanese, and that's my problem.
>> Did
>>>>> somebody in the list tried something like this before? Did it work?
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Tomás
>>>> 
>>> 
>>> --
>>> Walter Underwood
>>> Venture ASM, Troop 14, Palo Alto
>>> 
>>> 
>>> 
>> 
>> 

--
Walter Underwood
Venture ASM, Troop 14, Palo Alto

Re: Multiple Japanese Alphabets in Solr

Posted by François Schiettecatte <fs...@gmail.com>.

Tomás

The ICU code base is used by a *lot* so I think it is safe to say that it works ok :)

François

On Mar 11, 2011, at 12:49 PM, Tomás Fernández Löbbe wrote:

> "the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ'
> or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration
> will miss results."
> Exactly, that's my problem, searching on a different alphabet than the one
> on which it was indexed a document.
> François, thank you for your help. Have you used the new ICU Filters? Do
> they work OK? (I know it doesn't do Kanji)
> 
> Tomás
> 
> 2011/3/11 François Schiettecatte <fs...@gmail.com>
> 
>> Good question about transliteration, the issue has to do with recall, for
>> example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana
>> respectively), not doing the transliteration will miss results. You will
>> find that the big search engines do the transliteration for you
>> automatically. This issue get even more complicated when you dig into
>> orthographic variation because Japanese orthography is very variable (ie
>> there is more than one way to write a 'word'), as is tokenization (ie there
>> is more than one way to tokenize it), see:
>> 
>>       http://www.cjk.org/cjk/reference/japvar.htm
>> 
>> I have used the Basis Technology software in the past, it is very good, but
>> it is also very expensive.
>> 
>> François
>> 
>> On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote:
>> 
>>> Why not index it as-is? Solr can handle Unicode.
>>> 
>>> Transliterating hiragana to katakana is a very weird idea. I cannot
>> imagine how that would help.
>>> 
>>> You will need some sort of tokenization to find word boundaries. N-grams
>> work OK for search, but are really ugly for highlighting.
>>> 
>>> As far as I know, there are no good-quality free tokenizers for Japanese.
>> Basis Technology sells Japanese support that works with Lucene and Solr.
>>> 
>>> wunder
>>> 
>>> On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:
>>> 
>>>> Tomás
>>>> 
>>>> That wont really work, transliteration to Romaji works for individual
>> terms only so you would need to tokenize the Japanese prior to
>> transliteration. I am not sure what tool you plan to use for
>> transliteration, I have used ICU in the past and from what I can tell it
>> does not transliterates Kanji. Besides transliterating Kanji is debatable
>> for a variety of reasons.
>>>> 
>>>> What I would suggest is that you transliterate Hiragana to Katakana,
>> leave the Kanji alone, and index/search using ngrams. If you want 'proper'
>> tokenization I would recommend Mecab.
>>>> 
>>>> I have looked into this for a client and there is no clear cut solution.
>>>> 
>>>> Cheers
>>>> 
>>>> François
>>>> 
>>>> 
>>>> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
>>>> 
>>>>> This question is probably not a completely Solr question but it's
>> related to
>>>>> it. I'm dealing with a Japanese Solr application in which I would like
>> to be
>>>>> able to search in any of the Japanese Alphabets. The content can also
>> be in
>>>>> any Japanese Alphabet. I've been thinking in this solution: Convert
>>>>> everything to roma-ji, on Index time and query time.
>>>>> For example:
>>>>> 
>>>>> Indexing time:
>>>>> [Something in Hiragana] --> translate it to roma-ji --> index
>>>>> 
>>>>> Searching time:
>>>>> [Something in Katakana] --> translate it to roma-ji --> search
>>>>> or
>>>>> [Something in Kanji] --> translate it to roma-ji --> search
>>>>> 
>>>>> I don't have a deep understanding of Japanese, and that's my problem.
>> Did
>>>>> somebody in the list tried something like this before? Did it work?
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Tomás
>>>> 
>>> 
>>> --
>>> Walter Underwood
>>> Venture ASM, Troop 14, Palo Alto
>>> 
>>> 
>>> 
>> 
>>

Re: Multiple Japanese Alphabets in Solr

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

"the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ'
or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration
will miss results."
Exactly, that's my problem, searching on a different alphabet than the one
on which it was indexed a document.
François, thank you for your help. Have you used the new ICU Filters? Do
they work OK? (I know it doesn't do Kanji)

Tomás

2011/3/11 François Schiettecatte <fs...@gmail.com>

> Good question about transliteration, the issue has to do with recall, for
> example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana
> respectively), not doing the transliteration will miss results. You will
> find that the big search engines do the transliteration for you
> automatically. This issue get even more complicated when you dig into
> orthographic variation because Japanese orthography is very variable (ie
> there is more than one way to write a 'word'), as is tokenization (ie there
> is more than one way to tokenize it), see:
>
>        http://www.cjk.org/cjk/reference/japvar.htm
>
> I have used the Basis Technology software in the past, it is very good, but
> it is also very expensive.
>
> François
>
> On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote:
>
> > Why not index it as-is? Solr can handle Unicode.
> >
> > Transliterating hiragana to katakana is a very weird idea. I cannot
> imagine how that would help.
> >
> > You will need some sort of tokenization to find word boundaries. N-grams
> work OK for search, but are really ugly for highlighting.
> >
> > As far as I know, there are no good-quality free tokenizers for Japanese.
> Basis Technology sells Japanese support that works with Lucene and Solr.
> >
> > wunder
> >
> > On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:
> >
> >> Tomás
> >>
> >> That wont really work, transliteration to Romaji works for individual
> terms only so you would need to tokenize the Japanese prior to
> transliteration. I am not sure what tool you plan to use for
> transliteration, I have used ICU in the past and from what I can tell it
> does not transliterates Kanji. Besides transliterating Kanji is debatable
> for a variety of reasons.
> >>
> >> What I would suggest is that you transliterate Hiragana to Katakana,
> leave the Kanji alone, and index/search using ngrams. If you want 'proper'
> tokenization I would recommend Mecab.
> >>
> >> I have looked into this for a client and there is no clear cut solution.
> >>
> >> Cheers
> >>
> >> François
> >>
> >>
> >> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
> >>
> >>> This question is probably not a completely Solr question but it's
> related to
> >>> it. I'm dealing with a Japanese Solr application in which I would like
> to be
> >>> able to search in any of the Japanese Alphabets. The content can also
> be in
> >>> any Japanese Alphabet. I've been thinking in this solution: Convert
> >>> everything to roma-ji, on Index time and query time.
> >>> For example:
> >>>
> >>> Indexing time:
> >>> [Something in Hiragana] --> translate it to roma-ji --> index
> >>>
> >>> Searching time:
> >>> [Something in Katakana] --> translate it to roma-ji --> search
> >>> or
> >>> [Something in Kanji] --> translate it to roma-ji --> search
> >>>
> >>> I don't have a deep understanding of Japanese, and that's my problem.
> Did
> >>> somebody in the list tried something like this before? Did it work?
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Tomás
> >>
> >
> > --
> > Walter Underwood
> > Venture ASM, Troop 14, Palo Alto
> >
> >
> >
>
>

Re: Multiple Japanese Alphabets in Solr

Posted by François Schiettecatte <fs...@gmail.com>.

Good question about transliteration, the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration will miss results. You will find that the big search engines do the transliteration for you automatically. This issue get even more complicated when you dig into orthographic variation because Japanese orthography is very variable (ie there is more than one way to write a 'word'), as is tokenization (ie there is more than one way to tokenize it), see:

	http://www.cjk.org/cjk/reference/japvar.htm

I have used the Basis Technology software in the past, it is very good, but it is also very expensive.

François

On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote:

> Why not index it as-is? Solr can handle Unicode.
> 
> Transliterating hiragana to katakana is a very weird idea. I cannot imagine how that would help.
> 
> You will need some sort of tokenization to find word boundaries. N-grams work OK for search, but are really ugly for highlighting.
> 
> As far as I know, there are no good-quality free tokenizers for Japanese. Basis Technology sells Japanese support that works with Lucene and Solr.
> 
> wunder
> 
> On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:
> 
>> Tomás
>> 
>> That wont really work, transliteration to Romaji works for individual terms only so you would need to tokenize the Japanese prior to transliteration. I am not sure what tool you plan to use for transliteration, I have used ICU in the past and from what I can tell it does not transliterates Kanji. Besides transliterating Kanji is debatable for a variety of reasons.
>> 
>> What I would suggest is that you transliterate Hiragana to Katakana, leave the Kanji alone, and index/search using ngrams. If you want 'proper' tokenization I would recommend Mecab.
>> 
>> I have looked into this for a client and there is no clear cut solution.
>> 
>> Cheers
>> 
>> François
>> 
>> 
>> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
>> 
>>> This question is probably not a completely Solr question but it's related to
>>> it. I'm dealing with a Japanese Solr application in which I would like to be
>>> able to search in any of the Japanese Alphabets. The content can also be in
>>> any Japanese Alphabet. I've been thinking in this solution: Convert
>>> everything to roma-ji, on Index time and query time.
>>> For example:
>>> 
>>> Indexing time:
>>> [Something in Hiragana] --> translate it to roma-ji --> index
>>> 
>>> Searching time:
>>> [Something in Katakana] --> translate it to roma-ji --> search
>>> or
>>> [Something in Kanji] --> translate it to roma-ji --> search
>>> 
>>> I don't have a deep understanding of Japanese, and that's my problem. Did
>>> somebody in the list tried something like this before? Did it work?
>>> 
>>> 
>>> Thanks,
>>> 
>>> Tomás
>> 
> 
> --
> Walter Underwood
> Venture ASM, Troop 14, Palo Alto
> 
> 
>

Re: Multiple Japanese Alphabets in Solr

Posted by Walter Underwood <wu...@wunderwood.org>.

Why not index it as-is? Solr can handle Unicode.

Transliterating hiragana to katakana is a very weird idea. I cannot imagine how that would help.

You will need some sort of tokenization to find word boundaries. N-grams work OK for search, but are really ugly for highlighting.

As far as I know, there are no good-quality free tokenizers for Japanese. Basis Technology sells Japanese support that works with Lucene and Solr.

wunder

On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:

> Tomás
> 
> That wont really work, transliteration to Romaji works for individual terms only so you would need to tokenize the Japanese prior to transliteration. I am not sure what tool you plan to use for transliteration, I have used ICU in the past and from what I can tell it does not transliterates Kanji. Besides transliterating Kanji is debatable for a variety of reasons.
> 
> What I would suggest is that you transliterate Hiragana to Katakana, leave the Kanji alone, and index/search using ngrams. If you want 'proper' tokenization I would recommend Mecab.
> 
> I have looked into this for a client and there is no clear cut solution.
> 
> Cheers
> 
> François
> 
> 
> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
> 
>> This question is probably not a completely Solr question but it's related to
>> it. I'm dealing with a Japanese Solr application in which I would like to be
>> able to search in any of the Japanese Alphabets. The content can also be in
>> any Japanese Alphabet. I've been thinking in this solution: Convert
>> everything to roma-ji, on Index time and query time.
>> For example:
>> 
>> Indexing time:
>> [Something in Hiragana] --> translate it to roma-ji --> index
>> 
>> Searching time:
>> [Something in Katakana] --> translate it to roma-ji --> search
>> or
>> [Something in Kanji] --> translate it to roma-ji --> search
>> 
>> I don't have a deep understanding of Japanese, and that's my problem. Did
>> somebody in the list tried something like this before? Did it work?
>> 
>> 
>> Thanks,
>> 
>> Tomás
> 

--
Walter Underwood
Venture ASM, Troop 14, Palo Alto

Re: Multiple Japanese Alphabets in Solr

Posted by François Schiettecatte <fs...@gmail.com>.

Tomás

That wont really work, transliteration to Romaji works for individual terms only so you would need to tokenize the Japanese prior to transliteration. I am not sure what tool you plan to use for transliteration, I have used ICU in the past and from what I can tell it does not transliterates Kanji. Besides transliterating Kanji is debatable for a variety of reasons.

What I would suggest is that you transliterate Hiragana to Katakana, leave the Kanji alone, and index/search using ngrams. If you want 'proper' tokenization I would recommend Mecab.

I have looked into this for a client and there is no clear cut solution.

Cheers

François

On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:

> This question is probably not a completely Solr question but it's related to
> it. I'm dealing with a Japanese Solr application in which I would like to be
> able to search in any of the Japanese Alphabets. The content can also be in
> any Japanese Alphabet. I've been thinking in this solution: Convert
> everything to roma-ji, on Index time and query time.
> For example:
> 
> Indexing time:
> [Something in Hiragana] --> translate it to roma-ji --> index
> 
> Searching time:
> [Something in Katakana] --> translate it to roma-ji --> search
> or
> [Something in Kanji] --> translate it to roma-ji --> search
> 
> I don't have a deep understanding of Japanese, and that's my problem. Did
> somebody in the list tried something like this before? Did it work?
> 
> 
> Thanks,
> 
> Tomás