You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by vibhoreng04 <vi...@gmail.com> on 2013/10/18 20:01:46 UTC
Issues with Language detection in Solr
Hi All,I am trying to detect the language of the business name filed and the
address field. I am using Solr's lang Detect(Google Library) , not Tika. It
works ok in most of the cases but in some it detects the language
wrongly.For an example the document -"OrgName": "EXPLOITS VALLEY
HIGHGREENWOOD", "StreetLine1": "19 GREENWOOD AVE",
"StreetLine2": "", "SOrgName": "EXPLOITS VALLEY HIGHGREENWOOD",
"StandardizedStreetLine1": "19 GREENWOOD AVE", "language_s": [
"de" ]Language is detected as German(de) here , which is wrong.Below
is my
configuration-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
OrgName,StreetLine1,StreetLine2,SOrgName,StandardizedStreetLine1
language_s 0.9 en
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++Why
there is an issue?Why the language detection is wrong ?Please help !Vibhor
--
View this message in context: http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issues with Language detection in Solr
Posted by Jack Krupansky <ja...@basetechnology.com>.
Sorry, but Latin is not on the list of supported languages:
https://code.google.com/p/language-detection/wiki/LanguageList
-- Jack Krupansky
-----Original Message-----
From: vibhoreng04
Sent: Friday, October 18, 2013 3:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Issues with Language detection in Solr
I agree with you Jack . But I request you to see here that still this filter
works perfectly fine .Only in one case case where even all the words are
latin , the language is getting detected as German.My question is why and
how ?
If it works perfectly for the other docs what in this case is making it to
do abnormal behaiour ?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433p4096443.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issues with Language detection in Solr
Posted by vibhoreng04 <vi...@gmail.com>.
I agree with you Jack . But I request you to see here that still this filter
works perfectly fine .Only in one case case where even all the words are
latin , the language is getting detected as German.My question is why and
how ?
If it works perfectly for the other docs what in this case is making it to
do abnormal behaiour ?
--
View this message in context: http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433p4096443.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issues with Language detection in Solr
Posted by Jack Krupansky <ja...@basetechnology.com>.
I would say that in general you need at least 15 or 20 words in a text field
for language to be detected reasonably well. Sure, sometimes it can work for
8 to 12 words, but flip a coin how reliable it will be.
You haven't shown us any true text fields. I would say that language
detection against simple name fields is a misuse of the language detection
feature. I mean, it is designed for larger blocks of text, not very short
phrases.
See some examples in my e-book.
-- Jack Krupansky
-----Original Message-----
From: vibhoreng04
Sent: Friday, October 18, 2013 2:01 PM
To: solr-user@lucene.apache.org
Subject: Issues with Language detection in Solr
Hi All,I am trying to detect the language of the business name filed and the
address field. I am using Solr's lang Detect(Google Library) , not Tika. It
works ok in most of the cases but in some it detects the language
wrongly.For an example the document -"OrgName": "EXPLOITS VALLEY
HIGHGREENWOOD", "StreetLine1": "19 GREENWOOD AVE",
"StreetLine2": "", "SOrgName": "EXPLOITS VALLEY HIGHGREENWOOD",
"StandardizedStreetLine1": "19 GREENWOOD AVE", "language_s": [
"de" ]Language is detected as German(de) here , which is wrong.Below
is my
configuration-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
OrgName,StreetLine1,StreetLine2,SOrgName,StandardizedStreetLine1
language_s 0.9 en
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++Why
there is an issue?Why the language detection is wrong ?Please help !Vibhor
--
View this message in context:
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433.html
Sent from the Solr - User mailing list archive at Nabble.com.