You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by revathy arun <re...@gmail.com> on 2009/02/16 06:42:07 UTC

Multilanguage

Hi,
I have a scenario where ,i need to  convert pdf content to text  and then
index the same at run time .I do not know as to what language the pdf would
be ,in this case which is the best  soln i have with respect the content
field type in the schema where the text content would be indexed to?

That is can i use the default tokenizer for all languages and  since i would
not know the language and hence would not be able to stem the
tokens,how would  this impact search?Is there any other solution for the
same?

Rgds

Re: Multilanguage

Posted by Erick Erickson <er...@gmail.com>.

I recommend that you search both this and the
Lucene list. You'll find that this topic has been
discussed many times, and several approaches
have been outlined.

The searchable archives are linked to from here:
http://lucene.apache.org/java/docs/mailinglists.html.

Best
Erick

On Mon, Feb 16, 2009 at 12:42 AM, revathy arun <re...@gmail.com> wrote:

> Hi,
> I have a scenario where ,i need to  convert pdf content to text  and then
> index the same at run time .I do not know as to what language the pdf would
> be ,in this case which is the best  soln i have with respect the content
> field type in the schema where the text content would be indexed to?
>
> That is can i use the default tokenizer for all languages and  since i
> would
> not know the language and hence would not be able to stem the
> tokens,how would  this impact search?Is there any other solution for the
> same?
>
> Rgds
>

Re: Multilanguage

Posted by Karl Wettin <ka...@gmail.com>.

17 feb 2009 kl. 21.26 skrev Grant Ingersoll:

> I believe Karl Wettin submitted a Lucene patch for a Language  
> guesser: http://issues.apache.org/jira/browse/LUCENE-826 but it is  
> marked as won't fix.

The test case of LUCENE-1039 is a language classifier. I've use patch  
to detect languages of user queries (where I know the text contains  
text that is rather simple to classify as as specific language).


      karl

Re: Multilanguage

Posted by Walter Underwood <wu...@netflix.com>.

On 2/17/09 12:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:

> If purchasing, several companies offer solutions, but I don't know
> that their quality is any better than what you can get through open
> source, as generally speaking, the problem is solved with a high
> degree of accuracy through n-gram analysis.

The expensive part of the problem is getting a good corpus in each
language, tuning the classifier, and QA. The commercial ones usually
recognize encoding and language, which is more complicated. Sorting
out the ISO-2022 codes is a real mess, for example.

Pre-Unicode PDF files are also a horror. To do it right, you need
to recognize which fonts are Central European, and so on.

wunder

Re: Multilanguage

Posted by Grant Ingersoll <gs...@apache.org>.

There are a number of options for freeware here, just do some  
searching on your favorite Internet search engine.

TextCat is one of the more popular, as I seem to recall:  http://odur.let.rug.nl/~vannoord/TextCat/

I believe Karl Wettin submitted a Lucene patch for a Language guesser: http://issues.apache.org/jira/browse/LUCENE-826 
  but it is marked as won't fix.

Nutch has a Language Identification plugin as well (the document in  
the link below) that probably isn't too hard to extract the source  
from for your needs

Also see http://www.lucidimagination.com/search/?q=multilingual+detection 
  and also http://www.lucidimagination.com/search/?q=language 
+detection for help

If purchasing, several companies offer solutions, but I don't know  
that their quality is any better than what you can get through open  
source, as generally speaking, the problem is solved with a high  
degree of accuracy through n-gram analysis.

-Grant

On Feb 17, 2009, at 11:57 AM, revathy arun wrote:

> Hi Otis,
>
> But this is not freeware ,right?
>
>
>
>
> On 2/17/09, Otis Gospodnetic <ot...@yahoo.com> wrote:
>>
>> Hi,
>>
>> No, Tika doesn't do LangID.  I haven't used ngramj, so I can't  
>> speak for
>> its accuracy nor speed (but I know the code has been around for
>> years).  Another LangID implementation is at the URL below my name.
>>
>> Otis --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>>
>> ________________________________
>> From: revathy arun <re...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tuesday, February 17, 2009 6:39:40 PM
>> Subject: Re: Multilanguage
>>
>> Does Apache Tika help find the language of the given document?
>>
>>
>>
>> On 2/17/09, Till Kinstler <ki...@gbv.de> wrote:
>>>
>>> Paul Libbrecht schrieb:
>>>
>>> Clearly, then, something that matches words in a dictionary and  
>>> decides
>> on
>>>> the language based on the language of the majority could do a  
>>>> decent job
>> to
>>>> decide the analyzer.
>>>>
>>>> Does such a tool exist?
>>>>
>>>
>>> I once played around with http://ngramj.sourceforge.net/ for  
>>> language
>>> guessing. It did a good job. It doesn't use dictionaries for  
>>> language
>>> identification but a statistical approach using ngrams.
>>> I don't have any precise numbers, but out of about 10000 documents  
>>> in
>>> different languages (most in English, German and French, few in  
>>> other
>>> european languages like Polish) there were only some 10 not  
>>> identified
>>> correctly.
>>>
>>> Till
>>>
>>> --
>>> Till Kinstler
>>> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
>>> Platz der Göttinger Sieben 1, D 37073 Göttingen
>>> kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Multilanguage

Posted by revathy arun <re...@gmail.com>.

Hi Otis,

But this is not freeware ,right?




On 2/17/09, Otis Gospodnetic <ot...@yahoo.com> wrote:
>
> Hi,
>
> No, Tika doesn't do LangID.  I haven't used ngramj, so I can't speak for
> its accuracy nor speed (but I know the code has been around for
> years).  Another LangID implementation is at the URL below my name.
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
> ________________________________
> From: revathy arun <re...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, February 17, 2009 6:39:40 PM
> Subject: Re: Multilanguage
>
> Does Apache Tika help find the language of the given document?
>
>
>
> On 2/17/09, Till Kinstler <ki...@gbv.de> wrote:
> >
> > Paul Libbrecht schrieb:
> >
> > Clearly, then, something that matches words in a dictionary and decides
> on
> >> the language based on the language of the majority could do a decent job
> to
> >> decide the analyzer.
> >>
> >> Does such a tool exist?
> >>
> >
> > I once played around with http://ngramj.sourceforge.net/ for language
> > guessing. It did a good job. It doesn't use dictionaries for language
> > identification but a statistical approach using ngrams.
> > I don't have any precise numbers, but out of about 10000 documents in
> > different languages (most in English, German and French, few in other
> > european languages like Polish) there were only some 10 not identified
> > correctly.
> >
> > Till
> >
> > --
> > Till Kinstler
> > Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> > Platz der Göttinger Sieben 1, D 37073 Göttingen
> > kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
> >
>

Re: Multilanguage

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

No, Tika doesn't do LangID.  I haven't used ngramj, so I can't speak for its accuracy nor speed (but I know the code has been around for years).  Another LangID implementation is at the URL below my name.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 




________________________________
From: revathy arun <re...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage

Does Apache Tika help find the language of the given document?



On 2/17/09, Till Kinstler <ki...@gbv.de> wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the analyzer.
>>
>> Does such a tool exist?
>>
>
> I once played around with http://ngramj.sourceforge.net/ for language
> guessing. It did a good job. It doesn't use dictionaries for language
> identification but a statistical approach using ngrams.
> I don't have any precise numbers, but out of about 10000 documents in
> different languages (most in English, German and French, few in other
> european languages like Polish) there were only some 10 not identified
> correctly.
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>

Re: Multilanguage

Posted by revathy arun <re...@gmail.com>.

Does Apache Tika help find the language of the given document?



On 2/17/09, Till Kinstler <ki...@gbv.de> wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the analyzer.
>>
>> Does such a tool exist?
>>
>
> I once played around with http://ngramj.sourceforge.net/ for language
> guessing. It did a good job. It doesn't use dictionaries for language
> identification but a statistical approach using ngrams.
> I don't have any precise numbers, but out of about 10000 documents in
> different languages (most in English, German and French, few in other
> european languages like Polish) there were only some 10 not identified
> correctly.
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>

Re: Multilanguage

Posted by Till Kinstler <ki...@gbv.de>.

Paul Libbrecht schrieb:

> Clearly, then, something that matches words in a dictionary and decides 
> on the language based on the language of the majority could do a decent 
> job to decide the analyzer.
> 
> Does such a tool exist?

I once played around with http://ngramj.sourceforge.net/ for language 
guessing. It did a good job. It doesn't use dictionaries for language 
identification but a statistical approach using ngrams.
I don't have any precise numbers, but out of about 10000 documents in 
different languages (most in English, German and French, few in other 
european languages like Polish) there were only some 10 not identified 
correctly.

Till

-- 
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de

Re: Multilanguage

Posted by Paul Libbrecht <pa...@activemath.org>.

I was looking for such a tool and haven't found it yet.
Using StandardAnalyzer one can obtain some form of token-stream which  
can be used for "agnostic analysis".
Clearly, then, something that matches words in a dictionary and  
decides on the language based on the language of the majority could do  
a decent job to decide the analyzer.

Does such a tool exist?
It doesn't seem too hard for Lucene.

paul


Le 17-févr.-09 à 04:44, Otis Gospodnetic a écrit :

> The best option would be to identify the language after parsing the  
> PDF and then index it using an appropriate analyzer defined in  
> schema.xml.

Re: Multilanguage

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

The best option would be to identify the language after parsing the PDF and then index it using an appropriate analyzer defined in schema.xml.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 




________________________________
From: revathy arun <re...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 1:42:07 PM
Subject: Multilanguage

Hi,
I have a scenario where ,i need to  convert pdf content to text  and then
index the same at run time .I do not know as to what language the pdf would
be ,in this case which is the best  soln i have with respect the content
field type in the schema where the text content would be indexed to?

That is can i use the default tokenizer for all languages and  since i would
not know the language and hence would not be able to stem the
tokens,how would  this impact search?Is there any other solution for the
same?

Rgds