You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by bing <JS...@hotmail.com> on 2012/03/13 04:25:50 UTC

Can solr-langid(Solr3.5.0) detect multiple languages in one text?

Hi, all, 

I am using solr-langid(Solr3.5.0) to do language detection, and I hope
multiple languages in one text can be detected. 

The example text is: 
咖哩起源於印度。印度民間傳說咖哩是佛祖釋迦牟尼所創，由於咖哩的辛辣與香味可以幫助遮掩羊肉的腥騷，此舉即為用以幫助不吃豬肉與牛肉的印度人。在泰米爾語中，「kari」是「醬」的意思。在馬來西亞，kari也稱dal（當在mamak檔）。早期印度被蒙古人所建立的莫臥兒帝國（Mughal
Empire）所統治過，其間從波斯（現今的伊朗）帶來的飲食習慣，從而影響印度人的烹調風格直到現今。
Curry (plural, Curries) is a generic term primarily employed in Western
culture to denote a wide variety of dishes originating in Indian, Pakistani,
Bangladeshi, Sri Lankan, Thai or other Southeast Asian cuisines. Their
common feature is the incorporation of more or less complex combinations of
spices and herbs, usually (but not invariably) including fresh or dried hot
capsicum peppers, commonly called "chili" or "cayenne" peppers.

I want the text can be separated into two parts, and the part in Chinese
goes to "text_zh-tw" while the other one "text_en". Can I do something like
that? 

Thank you. 

Best Regards, 
Bing 


--
View this message in context: http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3821210.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can solr-langid(Solr3.5.0) detect multiple languages in one text?

Posted by bing <JS...@hotmail.com>.

Hi, Jan Høydahl, 

Forgot to mention, the identifier I use is an existing one wrapped in
Solr3.5.0.,  LangDetectLanguageIdentifier
(http://wiki.apache.org/solr/LanguageDetection). 

For the language identifier, I looked into the sc, and found that the whole
content of a text is parsed before detection, which is why the end result
consists of a specific language instead of multiple languages. Then I can
assume, if the content is processed section by section (or even line by
line), the end result shall consist of multiple languages. So the question
is, can you guys plug this modification of the existing identifier into
Solr? 


Best Regards, 
Bing 

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3821764.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can solr-langid(Solr3.5.0) detect multiple languages in one text?

Posted by bing <JS...@hotmail.com>.

Hi, Tanguy, 



>For the other implementation (
>http://code.google.com/p/language-detection/ ), it seems to be
>performing a first pass on the input, and tries to separate Latin
>characters from the others. If there's more non-Latin characters than
>Latin ones, then it will process the non-Latin characters only for
>language detection.
>Oddly, in the other way non-Latin characters are not stripped from the
>input if there's more Latin characters than non-Latin ones...

The example case does simplify, but it simulates the normal conditions I
need to handle, i.e. normally the task is to detect  non-Lantin  languages,
and mostly separate western and eastern languages. 

>Anyway, LangDetect's implementation ends up with a list of
>probabilities, and only the most accurate one is kept by solr's
>langdetect processor, if the probability satisfies a certain threshold. 

Yes, I agree with you on "a list of probabilities", and I think if those
probabilities are all returned, then my problem has been solved partially. 

>In this very particular case, something simple, based on unicode ranges
>could be used to provide hints on how to chunk the input. Because we
>need to split western and eastern languages, both written in well
>isolated unicode character-ranges.
>Using this, the language identifier could be fed with chunks that are
>mostly made of one language only (presumably), and we could have
>different language identifications for each distinct chunks. 

Intelligent chunk partition might be a different and comprehensive task. Is
it possible that the text is processed line by line (or several lines)? If
detected language changes in-between two continuous lines (or several
lines), it indicates a different language range.


Thank you for the thoughtful comments.  

Best Regards, 
Bing 

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3824365.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can solr-langid(Solr3.5.0) detect multiple languages in one text?

Posted by Tanguy Moal <ta...@gmail.com>.

Hi all,

I think that depending on the language detector implemention, things may 
vary...
For Tika, it performs better with longer inputs than shorter ones (as it 
seems to depend on the probabilistic distribution of ngrams -- of 
different sizes -- to perform distance computations with precomputed 
language-models).
 From what I've understood, shortening the input could therefor confuse 
the detector.
Nevertheless, feeding the language identifier with text known to be 
written in many languages will decrease the confidence of the language 
identifier's predictions, for sure. If the text is half English, half 
Chinese, then language detector may even not be able to give a 
prediction above the certainty threshold.

For the other implementation ( 
http://code.google.com/p/language-detection/ ), it seems to be 
performing a first pass on the input, and tries to separate Latin 
characters from the others. If there's more non-Latin characters than 
Latin ones, then it will process the non-Latin characters only for 
language detection.
Oddly, in the other way non-Latin characters are not stripped from the 
input if there's more Latin characters than non-Latin ones...

Anyway, LangDetect's implementation ends up with a list of 
probabilities, and only the most accurate one is kept by solr's 
langdetect processor, if the probability satisfies a certain threshold.

The tricky part here is chunking the input into an arbitrary number of 
chunks : this is eventually expansive and complicated, so there's the 
need to find a good candidate partition of the input.

In this very particular case, something simple, based on unicode ranges 
could be used to provide hints on how to chunk the input. Because we 
need to split western and eastern languages, both written in well 
isolated unicode character-ranges.
Using this, the language identifier could be fed with chunks that are 
mostly made of one language only (presumably), and we could have 
different language identifications for each distinct chunks.

The hard part remains for languages sharing a large number of characters 
I guess. It's hard to say here are the french parts and there are the 
italian parts based on unicode character ranges only.
That's even more complicated when input text is badly accentued, a 
phenomenon occuring quite frequently, but that's another thread :)

I don't know if that helps, I was just reading the thread mentioned 
yesterday and then this message about language detection arrived on the 
list...

Kind regards,

--
Tanguy

Le 13/03/2012 09:55, Jan Høydahl a écrit :
> Hi,
>
> Language detection cannot do that as of now. It would be a great improvement though. Language detectors are pluggable, perhaps if you know of a Java language detector which can do this we could plug it in? Or we could extend the current identifier with a capability of first splitting the text into chunks and then do langid on each chunk. If you'd like to open a JIRA for this, it will not be forgotten...
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 13. mars 2012, at 04:25, bing wrote:
>
>> Hi, all,
>>
>> I am using solr-langid(Solr3.5.0) to do language detection, and I hope
>> multiple languages in one text can be detected.
>>
>> The example text is:
>> 咖哩起源於印度。印度民間傳說咖哩是佛祖釋迦牟尼所創，由於咖哩的辛辣與香味可以幫助遮掩羊肉的腥騷，此舉即為用以幫助不吃豬肉與牛肉的印度人。在泰米爾語中，「kari」是「醬」的意思。在馬來西亞，kari也稱dal（當在mamak檔）。早期印度被蒙古人所建立的莫臥兒帝國（Mughal
>> Empire）所統治過，其間從波斯（現今的伊朗）帶來的飲食習慣，從而影響印度人的烹調風格直到現今。
>> Curry (plural, Curries) is a generic term primarily employed in Western
>> culture to denote a wide variety of dishes originating in Indian, Pakistani,
>> Bangladeshi, Sri Lankan, Thai or other Southeast Asian cuisines. Their
>> common feature is the incorporation of more or less complex combinations of
>> spices and herbs, usually (but not invariably) including fresh or dried hot
>> capsicum peppers, commonly called "chili" or "cayenne" peppers.
>>
>> I want the text can be separated into two parts, and the part in Chinese
>> goes to "text_zh-tw" while the other one "text_en". Can I do something like
>> that?
>>
>> Thank you.
>>
>> Best Regards,
>> Bing
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3821210.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can solr-langid(Solr3.5.0) detect multiple languages in one text?

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Language detection cannot do that as of now. It would be a great improvement though. Language detectors are pluggable, perhaps if you know of a Java language detector which can do this we could plug it in? Or we could extend the current identifier with a capability of first splitting the text into chunks and then do langid on each chunk. If you'd like to open a JIRA for this, it will not be forgotten...

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. mars 2012, at 04:25, bing wrote:

> Hi, all, 
> 
> I am using solr-langid(Solr3.5.0) to do language detection, and I hope
> multiple languages in one text can be detected. 
> 
> The example text is: 
> 咖哩起源於印度。印度民間傳說咖哩是佛祖釋迦牟尼所創，由於咖哩的辛辣與香味可以幫助遮掩羊肉的腥騷，此舉即為用以幫助不吃豬肉與牛肉的印度人。在泰米爾語中，「kari」是「醬」的意思。在馬來西亞，kari也稱dal（當在mamak檔）。早期印度被蒙古人所建立的莫臥兒帝國（Mughal
> Empire）所統治過，其間從波斯（現今的伊朗）帶來的飲食習慣，從而影響印度人的烹調風格直到現今。
> Curry (plural, Curries) is a generic term primarily employed in Western
> culture to denote a wide variety of dishes originating in Indian, Pakistani,
> Bangladeshi, Sri Lankan, Thai or other Southeast Asian cuisines. Their
> common feature is the incorporation of more or less complex combinations of
> spices and herbs, usually (but not invariably) including fresh or dried hot
> capsicum peppers, commonly called "chili" or "cayenne" peppers.
> 
> I want the text can be separated into two parts, and the part in Chinese
> goes to "text_zh-tw" while the other one "text_en". Can I do something like
> that? 
> 
> Thank you. 
> 
> Best Regards, 
> Bing 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3821210.html
> Sent from the Solr - User mailing list archive at Nabble.com.