You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Van Tassell, Kristian" <kr...@siemens.com> on 2013/03/21 15:01:38 UTC

What to expect when testing Japanese search index

I’m trying to set up our search index to handle Japanese data, and while some searches yield results, others do not. This is especially true the smaller the search term.

For example, searching for this term: 更

Yields no results even though I know it appears in the text. I understand that this character alone may not be a full word without further context, and thus, perhaps it should not return a hit(?).

What about putting a star after it? 更*

Should that return hits? I had been using the text_ja boilerplate setup, but wonder if a bigram (text_cjk) may work better for my non-Japanese speaking testing phase. Thanks in advance for any insight!


Re: What to expect when testing Japanese search index

Posted by Hayden Muhl <ha...@gmail.com>.
A search for a single character will only return hits if that character
makes up a whole word, and only if the tokenizer recognizes that character
as a word. It's just like in other languages, where a search for "p" won't
return documents with the word "apple".

If I were you, I would go into the Solr admin UI and start playing around
with the analysis tool. You can paste a phrase in there and it will show
you what tokens that phrase will be broken into. I think that will give you
a better understanding of why you are getting these search results.

You also don't mention which version of Solr you are using. Can you also
include the definition of your text_ja field type?

- Hayden


On Thu, Mar 21, 2013 at 7:01 AM, Van Tassell, Kristian <
kristian.vantassell@siemens.com> wrote:

> I’m trying to set up our search index to handle Japanese data, and while
> some searches yield results, others do not. This is especially true the
> smaller the search term.
>
> For example, searching for this term: 更
>
> Yields no results even though I know it appears in the text. I understand
> that this character alone may not be a full word without further context,
> and thus, perhaps it should not return a hit(?).
>
> What about putting a star after it? 更*
>
> Should that return hits? I had been using the text_ja boilerplate setup,
> but wonder if a bigram (text_cjk) may work better for my non-Japanese
> speaking testing phase. Thanks in advance for any insight!
>
>