You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Prakashganesh, Prabhu" <Pr...@dowjones.com> on 2012/04/04 14:29:40 UTC

Choosing tokenizer based on language of document

Hi,
      I have documents in different languages and I want to choose the tokenizer to use for a document based on the language of the document. The language of the document is already known and is indexed in a field. What I want to do is when I index the text in the document, I want to choose the tokenizer to use based on the value of the language field. I want to use one field for the text in the document (defining multiple fields for each language is not an option). It seems like I can define a tokenizer for a field, so I guess what I need to do is to write a custom tokenizer that looks at the language field value of the document and calls the appropriate tokenizer for that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK languages etc..). From whatever I have read, it seems quite straight forward to write a custom tokenizer, but how would this custom tokenizer know the language of the document? Is there some way I can pass in this value to the tokenizer? Or is there some way the tokenizer will have access to other fields in the document?. Would be really helpful if someone can provide an answer

Thanks
Prabhu

RE: Choosing tokenizer based on language of document

Posted by "Prakashganesh, Prabhu" <Pr...@dowjones.com>.
Hi Dominique, Eric,
       Thanks for replying. At a high level, what I am trying to work out is pros and cons of different approaches to handling multi lingual content. >From what I have read on the web, the most common/recommended way seems to be to split/shard by language, then each shard/index has content of a particular language only and you can define language specific char filter, analyser, tokenizer etc for each shard.. The big issue I have with this approach is shards/indexes with most often searched content (e.g. English content) would be doing more work most of the time and load is not evenly distributed across the servers. Another suggestion seems to be to have one field per language, this makes searching across languages difficult and slower, and seems quite cumbersome. I am trying to figure out the downsides to having one field for all languages and even distribution of all content across all shards/indexes and see if we can address the negatives in a reasonable way. As you guys say, working out the correct language to use needs to be done on both index and query side. On index side, the language of document is known. On query side, users would define a list of language filters to search on (e.g. la=(en or fr or zhcn etc..), one or more) and we can also try to detect the query language from the char set of the query terms. I really do not need language specific tokenisation for non CJK+ content. I have a set of tokenisation/normalisation rules which I can easily define in a custom char filter and the Solr Standard tokenizer should work reasonably well for non-CJK+ content (e.g. English/French/German etc.. about 20 odd languages). The standard tokenizer tokenizes CJK+ content as unigrams and that would not be good enough. We would need a good tokenizer for CJK+ (Chinese, Japanse, Korean, Thai and a few other Asian languages). Hence I asked the question if we can choose tokenizer based on language of content. If the document is in one of the non CJK+ languages, I would choose the custom tokenizer with a chain of custom char filter and Standard tokenizer. If one of the CJK+ languages, choose the Solr CJKtokenizer or appropriate chain. On the query side, if query contains terms with CJK+ char set or explicit language filter for one of the CJK+ languages, choose the CJK+ analyser chain, else choose the custom standard chain. In future, I could make the query language detection more sophisticated using dictionaries etc.. From what Dominique says, it seems like there is no easy way for the analyser chain to know the value of the "language" field. Your approach seems to be to include the language in the beginning of the stream on the index side and query side. Are there any other ways to do this?
One thing I haven't mentioned is Stemming. Stemming seems to be defined as part of the analyser chain, but there doesn't seem to be an easy way to turn on/off stemming from the query side. If Stemming is included in the analyser chain at index time, you have the stemmed version only in the index and to match you have to use the same stemmer on the query side and get the query language right. If I want to control stemming on query side (turn on/off), the only way seems to be to have two fields - one defined with a stemmer in the chain and another without. This would duplicate data and increase the index size. A better way would be to index only the stemmed words which are different from the original in a separate index structure and to search the regular index structure if I set stemming to off on query side and search across both if stemming is on query time. When stemming is turned on for a query, you would need to get the query language right, but users would typically want to turn on stemming when searching a specific language, so would not be difficult to get the query language. Any thoughts/ideas on turning stemming on/off?

Thanks
Prabhu


-----Original Message-----
From: Dominique Bejean [mailto:dominique.bejean@eolya.fr] 
Sent: 06 April 2012 10:58
To: solr-user@lucene.apache.org
Subject: Re: Choosing tokenizer based on language of document

Hi,

Yes, I agree it is not an easy issue. Index all languages with the 
appropriate char filter, tokenizer and filters for each language is not 
possible without new text type and new analyzer development.

If you plan to index up to 10 different languages, I suggest one text 
field per language or one index per language.

One field for all language can be interesting if you plan to index a lot 
of different languages in the same index. In this case, have one field 
per language (text_en, text_fr, ...) can be complicated if you want the 
user be able in one query to retrieve documents in any languages. The 
query will be complex if you have 50 different languages (text_en:... OR 
text_fr:... OR ...).

In order to achieve this you will need to developp a specific analyzer. 
This analyzer will be in charge of use correct char filter, tokenizer 
and filters for the language of the document. You will need a 
configurable analyzer in order to change specific languages setting 
(enable stemming or not, chose a specific stopwords file, ...).

I did this several years ago for solr 1.4.1. This is still working for 
solr 3.x. The default of this analyzer is that all language settings are 
hard coded (tokenizer, filters, stopwords, ..). With Solr 4.0, the 
analyzer do not work anymore. I decided to redevelop it in order to be 
able to configure all languages settings in a external configuration 
file and have nothing hardcoded.

I had to develop the analyzer but also a field type.

The main issue is in fact that the analyzer is not aware of the values 
in other fields. So it is not possible to use an other field in order to 
specify the content language. The only way I found is to start content 
with a specific char sequence : [en]... or [fr]...
The analyzer needs to know the language of the query too. So query 
criteria for the multilingual field have to include the specific char 
sequence : [en]...

If you are interested by this work, let me know.

If someone knows how to provide to the analyzer the content language a 
index time or the query language at query time in an other way I did, I 
am interested :).

Regards.

Dominique











Le 05/04/12 23:36, Erick Erickson a écrit :
> This is really difficult to imagine working well. Even if you
> do choose the appropriate analysis chain (and it must
> be a chain here), and manage to appropriately tokenize
> for each language, what happens at query time?
>
> How do you expect to get matches on, say, Ukranian when
> the tokens of the query are in Erse?
>
> This feels like an XY problem, can you explain at a
> higher level what your requirements are?
>
> Best
> Erick
>
> On Wed, Apr 4, 2012 at 8:29 AM, Prakashganesh, Prabhu
> <Pr...@dowjones.com>  wrote:
>> Hi,
>>       I have documents in different languages and I want to choose the tokenizer to use for a document based on the language of the document. The language of the document is already known and is indexed in a field. What I want to do is when I index the text in the document, I want to choose the tokenizer to use based on the value of the language field. I want to use one field for the text in the document (defining multiple fields for each language is not an option). It seems like I can define a tokenizer for a field, so I guess what I need to do is to write a custom tokenizer that looks at the language field value of the document and calls the appropriate tokenizer for that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK languages etc..). From whatever I have read, it seems quite straight forward to write a custom tokenizer, but how would this custom tokenizer know the language of the document? Is there some way I can pass in this value to the tokenizer? Or is there some way the tokenizer will have access to other fields in the document?. Would be really helpful if someone can provide an answer
>>
>> Thanks
>> Prabhu

Re: Choosing tokenizer based on language of document

Posted by Dominique Bejean <do...@eolya.fr>.
Hi,

Yes, I agree it is not an easy issue. Index all languages with the 
appropriate char filter, tokenizer and filters for each language is not 
possible without new text type and new analyzer development.

If you plan to index up to 10 different languages, I suggest one text 
field per language or one index per language.

One field for all language can be interesting if you plan to index a lot 
of different languages in the same index. In this case, have one field 
per language (text_en, text_fr, ...) can be complicated if you want the 
user be able in one query to retrieve documents in any languages. The 
query will be complex if you have 50 different languages (text_en:... OR 
text_fr:... OR ...).

In order to achieve this you will need to developp a specific analyzer. 
This analyzer will be in charge of use correct char filter, tokenizer 
and filters for the language of the document. You will need a 
configurable analyzer in order to change specific languages setting 
(enable stemming or not, chose a specific stopwords file, ...).

I did this several years ago for solr 1.4.1. This is still working for 
solr 3.x. The default of this analyzer is that all language settings are 
hard coded (tokenizer, filters, stopwords, ..). With Solr 4.0, the 
analyzer do not work anymore. I decided to redevelop it in order to be 
able to configure all languages settings in a external configuration 
file and have nothing hardcoded.

I had to develop the analyzer but also a field type.

The main issue is in fact that the analyzer is not aware of the values 
in other fields. So it is not possible to use an other field in order to 
specify the content language. The only way I found is to start content 
with a specific char sequence : [en]... or [fr]...
The analyzer needs to know the language of the query too. So query 
criteria for the multilingual field have to include the specific char 
sequence : [en]...

If you are interested by this work, let me know.

If someone knows how to provide to the analyzer the content language a 
index time or the query language at query time in an other way I did, I 
am interested :).

Regards.

Dominique











Le 05/04/12 23:36, Erick Erickson a écrit :
> This is really difficult to imagine working well. Even if you
> do choose the appropriate analysis chain (and it must
> be a chain here), and manage to appropriately tokenize
> for each language, what happens at query time?
>
> How do you expect to get matches on, say, Ukranian when
> the tokens of the query are in Erse?
>
> This feels like an XY problem, can you explain at a
> higher level what your requirements are?
>
> Best
> Erick
>
> On Wed, Apr 4, 2012 at 8:29 AM, Prakashganesh, Prabhu
> <Pr...@dowjones.com>  wrote:
>> Hi,
>>       I have documents in different languages and I want to choose the tokenizer to use for a document based on the language of the document. The language of the document is already known and is indexed in a field. What I want to do is when I index the text in the document, I want to choose the tokenizer to use based on the value of the language field. I want to use one field for the text in the document (defining multiple fields for each language is not an option). It seems like I can define a tokenizer for a field, so I guess what I need to do is to write a custom tokenizer that looks at the language field value of the document and calls the appropriate tokenizer for that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK languages etc..). From whatever I have read, it seems quite straight forward to write a custom tokenizer, but how would this custom tokenizer know the language of the document? Is there some way I can pass in this value to the tokenizer? Or is there some way the tokenizer will have access to other fields in the document?. Would be really helpful if someone can provide an answer
>>
>> Thanks
>> Prabhu


Re: Choosing tokenizer based on language of document

Posted by Erick Erickson <er...@gmail.com>.
This is really difficult to imagine working well. Even if you
do choose the appropriate analysis chain (and it must
be a chain here), and manage to appropriately tokenize
for each language, what happens at query time?

How do you expect to get matches on, say, Ukranian when
the tokens of the query are in Erse?

This feels like an XY problem, can you explain at a
higher level what your requirements are?

Best
Erick

On Wed, Apr 4, 2012 at 8:29 AM, Prakashganesh, Prabhu
<Pr...@dowjones.com> wrote:
> Hi,
>      I have documents in different languages and I want to choose the tokenizer to use for a document based on the language of the document. The language of the document is already known and is indexed in a field. What I want to do is when I index the text in the document, I want to choose the tokenizer to use based on the value of the language field. I want to use one field for the text in the document (defining multiple fields for each language is not an option). It seems like I can define a tokenizer for a field, so I guess what I need to do is to write a custom tokenizer that looks at the language field value of the document and calls the appropriate tokenizer for that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK languages etc..). From whatever I have read, it seems quite straight forward to write a custom tokenizer, but how would this custom tokenizer know the language of the document? Is there some way I can pass in this value to the tokenizer? Or is there some way the tokenizer will have access to other fields in the document?. Would be really helpful if someone can provide an answer
>
> Thanks
> Prabhu