You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Eugene <be...@gmail.com> on 2014/07/30 19:47:32 UTC

Implementing custom analyzer for multi-language stemming

    Hello, fellow Solr and Lucene users and developers!

    In our project we receive text from users in different languages. We
detect language automatically and use Google Translate APIs a lot (so
having arbitrary number of languages in our system doesn't concern us).
However we need to be able to search using stemming. Having nearly hundred
of fields (several fields for each language with language-specific
stemmers) listed in our search query is not an option. So we need a way to
have a single index which has stemmed tokens for different languages. I
have two questions:

    1. Are there already (third-party) custom multi-language stemming
analyzers? (I doubt that no one else ran into this issue)

    2. If I'm going to implement such analyzer myself, could you please
suggest a better way to 'pass' detected language value into such analyzer?
Detecting language in analyzer itself is not an option, because: a) we
already detect it in other place b) we do it based on combined values of
many fields ('name', 'topic', 'description', etc.), while current field can
be to short for reliable detection c) sometimes we just want to specify
language explicitly. The obvious hack would be to prepend ISO 639-1 code to
field value. But I'd like to believe that Solr allows for cleaner solution.
I could think about either: a) custom query parameter (but I guess, it will
require modifying request handlers, etc. which is highly undesirable) b)
getting value from other field (we obviously have 'language' field and we
do not have mixed-language records). If it is possible, could you please
describe the mechanism for doing this or point to relevant code examples?
Thank you very much and have a good day!

Re[2]: Implementing custom analyzer for multi-language stemming

Posted by roman-v1 <ro...@mail.ru>.


Thu, 18 Sep 2014 01:34:29 -0700 (PDT) от "roman-v1 [via Lucene]" <ml...@n3.nabble.com>:
>Is there a way to set attribute in tokenizer to document to search by word and this attribute?

	
	
	
	
>
>----------------------------------------------------------------------
>If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Implementing-custom-analyzer-for-multi-language-stemming-tp4150156p4159594.html
>To unsubscribe from Implementing custom analyzer for multi-language stemming,  click here .
>NAML




--
View this message in context: http://lucene.472066.n3.nabble.com/Implementing-custom-analyzer-for-multi-language-stemming-tp4150156p4159617.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Implementing custom analyzer for multi-language stemming

Posted by roman-v1 <ro...@mail.ru>.
Is there a way to set attribute in tokenizer to document to search by word
and this attribute?



--
View this message in context: http://lucene.472066.n3.nabble.com/Implementing-custom-analyzer-for-multi-language-stemming-tp4150156p4159594.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Implementing custom analyzer for multi-language stemming

Posted by roman-v1 <ro...@mail.ru>.
If each token have a languageattribute on it, when I search by word and
language and if hightlighting is switched on, each word of sentence will be
highlighted. Because of it this solution not fit.



--
View this message in context: http://lucene.472066.n3.nabble.com/Implementing-custom-analyzer-for-multi-language-stemming-tp4150156p4159550.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Implementing custom analyzer for multi-language stemming

Posted by Rich Cariens <ri...@gmail.com>.
Yes, each token could have a LanguageAttribute on it, just like
ScriptAttributes. I didn't *think* a span would be necessary.

I would also add a multivalued "lang" field to the document. Searching
English documents for "die" might look like: "q=die&lang=eng". The "lang"
param could tell the RequestHandler to add a filter query "fq=lang:eng" to
constrain the search to the English corpus, as well as recruit an English
analyzer when tokenizing the "die" query term.

Since I can't control text length, I would just let the language detection
tool do it's best and not sweat it.


On Wed, Aug 6, 2014 at 12:11 AM, TK <ku...@sonic.net> wrote:

>
> On 8/5/14, 8:36 AM, Rich Cariens wrote:
>
>> Of course this is extremely primitive and basic, but I think it would be
>> possible to write a CharFilter or TokenFilter that inspects the entire
>> TokenStream to guess the language(s), perhaps even noting where languages
>> change. Language and position information could be tracked, the
>> TokenStream
>> rewound and then Tokens emitted with "LanguageAttributes" for downstream
>> Token stemmers to deal with.
>>
>>  I'm curious how you are planning to handle the languageAttribute.
> Would each token have this attribute denoting a span of Tokens
> with a language? But then how would you search
> English documents that includes the term "die" while skipping
> all the German documents which most likely to have "die"?
>
> Automatic language detection works OK for long text of
> regular kind of contents.  But it doesn't work well with short
> text. What strategy would you use to deal with short text?
>
> --
> TK
>
>

Re: Implementing custom analyzer for multi-language stemming

Posted by TK <ku...@sonic.net>.
On 8/5/14, 8:36 AM, Rich Cariens wrote:
> Of course this is extremely primitive and basic, but I think it would be
> possible to write a CharFilter or TokenFilter that inspects the entire
> TokenStream to guess the language(s), perhaps even noting where languages
> change. Language and position information could be tracked, the TokenStream
> rewound and then Tokens emitted with "LanguageAttributes" for downstream
> Token stemmers to deal with.
>
I'm curious how you are planning to handle the languageAttribute.
Would each token have this attribute denoting a span of Tokens
with a language? But then how would you search
English documents that includes the term "die" while skipping
all the German documents which most likely to have "die"?

Automatic language detection works OK for long text of
regular kind of contents.  But it doesn't work well with short
text. What strategy would you use to deal with short text?

-- 
TK


Re: Implementing custom analyzer for multi-language stemming

Posted by Rich Cariens <ri...@gmail.com>.
I've started a GitHub project to try out some cross-lingual analysis ideas (
https://github.com/whateverdood/cross-lingual-search). I haven't played
over there for about 3 months, but plan on restarting work there shortly.
In a nutshell, the interesting component
("SimplePolyGlotStemmingTokenFilter") relies on ICU4J ScriptAttributes:
each token is inspected for it's script, i.e. "latin" or "arabic", and then
a "ScriptStemmer" recruits the appropriate stemmer to handle the token.

Of course this is extremely primitive and basic, but I think it would be
possible to write a CharFilter or TokenFilter that inspects the entire
TokenStream to guess the language(s), perhaps even noting where languages
change. Language and position information could be tracked, the TokenStream
rewound and then Tokens emitted with "LanguageAttributes" for downstream
Token stemmers to deal with.

Or is that a crazy idea?


On Tue, Aug 5, 2014 at 12:10 AM, TK <ku...@sonic.net> wrote:

> On 7/30/14, 10:47 AM, Eugene wrote:
>
>>      Hello, fellow Solr and Lucene users and developers!
>>
>>      In our project we receive text from users in different languages. We
>> detect language automatically and use Google Translate APIs a lot (so
>> having arbitrary number of languages in our system doesn't concern us).
>> However we need to be able to search using stemming. Having nearly hundred
>> of fields (several fields for each language with language-specific
>> stemmers) listed in our search query is not an option. So we need a way to
>> have a single index which has stemmed tokens for different languages.
>>
>
> Do you mean to have a Tokenizer that switches among supported languages
> depending on the "lang" field? This is something I thought about when I
> started working on Solr/Lucene and soon I realized it is not possible
> because
> of the way Lucene is designed; The Tokenizer in an analyzer chain cannot
> peek
> other field's value, or there is no way to control which field is processed
> first.
>
> If that's not what you are trying to achieve, could you tell us what
> it is? If you have different language text in a single field, and if
> someone search for a word common to many languages,
> such as "sports" (or "Lucene" for that matter), Solr will return
> the documents of different languages, most of which the user
> doesn't understand. Would that be useful? If you have
> a special use case, would you like to share it?
>
> --
> Kuro
>

Re: Implementing custom analyzer for multi-language stemming

Posted by TK <ku...@sonic.net>.
On 7/30/14, 10:47 AM, Eugene wrote:
>      Hello, fellow Solr and Lucene users and developers!
>
>      In our project we receive text from users in different languages. We
> detect language automatically and use Google Translate APIs a lot (so
> having arbitrary number of languages in our system doesn't concern us).
> However we need to be able to search using stemming. Having nearly hundred
> of fields (several fields for each language with language-specific
> stemmers) listed in our search query is not an option. So we need a way to
> have a single index which has stemmed tokens for different languages.

Do you mean to have a Tokenizer that switches among supported languages
depending on the "lang" field? This is something I thought about when I
started working on Solr/Lucene and soon I realized it is not possible because
of the way Lucene is designed; The Tokenizer in an analyzer chain cannot peek
other field's value, or there is no way to control which field is processed
first.

If that's not what you are trying to achieve, could you tell us what
it is? If you have different language text in a single field, and if
someone search for a word common to many languages,
such as "sports" (or "Lucene" for that matter), Solr will return
the documents of different languages, most of which the user
doesn't understand. Would that be useful? If you have
a special use case, would you like to share it?

-- 
Kuro

Re: Implementing custom analyzer for multi-language stemming

Posted by atawfik <co...@gmail.com>.
Hi,

The author of Solr in Action has produced something similar to what you
want. I even has used it for one of my projects where I needed to
automatically analyze languages.  Here is the link to its code 
https://github.com/treygrainger/solr-in-action/tree/master/src/main/java/sia/ch14
<https://github.com/treygrainger/solr-in-action/tree/master/src/main/java/sia/ch14> 
.

Nevertheless, you need to pay attention that not all languages are supported
by Lucene or Solr. Therefore, some of the languages detected by Google API
will not have their responding chain analysis. You need to develop that.

In another project, I am following the same approach to develop an
AutoAnalyzer for Lucene without using Solr. So, let me know if you want
directions in how to do it.

Regards
Ameer



--
View this message in context: http://lucene.472066.n3.nabble.com/Implementing-custom-analyzer-for-multi-language-stemming-tp4150156p4159588.html
Sent from the Solr - User mailing list archive at Nabble.com.