You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by revathy arun <re...@gmail.com> on 2009/02/16 06:42:07 UTC
Multilanguage
Hi,
I have a scenario where ,i need to convert pdf content to text and then
index the same at run time .I do not know as to what language the pdf would
be ,in this case which is the best soln i have with respect the content
field type in the schema where the text content would be indexed to?
That is can i use the default tokenizer for all languages and since i would
not know the language and hence would not be able to stem the
tokens,how would this impact search?Is there any other solution for the
same?
Rgds
Re: Multilanguage
Posted by Erick Erickson <er...@gmail.com>.
I recommend that you search both this and the
Lucene list. You'll find that this topic has been
discussed many times, and several approaches
have been outlined.
The searchable archives are linked to from here:
http://lucene.apache.org/java/docs/mailinglists.html.
Best
Erick
On Mon, Feb 16, 2009 at 12:42 AM, revathy arun <re...@gmail.com> wrote:
> Hi,
> I have a scenario where ,i need to convert pdf content to text and then
> index the same at run time .I do not know as to what language the pdf would
> be ,in this case which is the best soln i have with respect the content
> field type in the schema where the text content would be indexed to?
>
> That is can i use the default tokenizer for all languages and since i
> would
> not know the language and hence would not be able to stem the
> tokens,how would this impact search?Is there any other solution for the
> same?
>
> Rgds
>
Re: Multilanguage
Posted by Karl Wettin <ka...@gmail.com>.
17 feb 2009 kl. 21.26 skrev Grant Ingersoll:
> I believe Karl Wettin submitted a Lucene patch for a Language
> guesser: http://issues.apache.org/jira/browse/LUCENE-826 but it is
> marked as won't fix.
The test case of LUCENE-1039 is a language classifier. I've use patch
to detect languages of user queries (where I know the text contains
text that is rather simple to classify as as specific language).
karl
Re: Multilanguage
Posted by Walter Underwood <wu...@netflix.com>.
On 2/17/09 12:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> If purchasing, several companies offer solutions, but I don't know
> that their quality is any better than what you can get through open
> source, as generally speaking, the problem is solved with a high
> degree of accuracy through n-gram analysis.
The expensive part of the problem is getting a good corpus in each
language, tuning the classifier, and QA. The commercial ones usually
recognize encoding and language, which is more complicated. Sorting
out the ISO-2022 codes is a real mess, for example.
Pre-Unicode PDF files are also a horror. To do it right, you need
to recognize which fonts are Central European, and so on.
wunder
Re: Multilanguage
Posted by Grant Ingersoll <gs...@apache.org>.
There are a number of options for freeware here, just do some
searching on your favorite Internet search engine.
TextCat is one of the more popular, as I seem to recall: http://odur.let.rug.nl/~vannoord/TextCat/
I believe Karl Wettin submitted a Lucene patch for a Language guesser: http://issues.apache.org/jira/browse/LUCENE-826
but it is marked as won't fix.
Nutch has a Language Identification plugin as well (the document in
the link below) that probably isn't too hard to extract the source
from for your needs
Also see http://www.lucidimagination.com/search/?q=multilingual+detection
and also http://www.lucidimagination.com/search/?q=language
+detection for help
If purchasing, several companies offer solutions, but I don't know
that their quality is any better than what you can get through open
source, as generally speaking, the problem is solved with a high
degree of accuracy through n-gram analysis.
-Grant
On Feb 17, 2009, at 11:57 AM, revathy arun wrote:
> Hi Otis,
>
> But this is not freeware ,right?
>
>
>
>
> On 2/17/09, Otis Gospodnetic <ot...@yahoo.com> wrote:
>>
>> Hi,
>>
>> No, Tika doesn't do LangID. I haven't used ngramj, so I can't
>> speak for
>> its accuracy nor speed (but I know the code has been around for
>> years). Another LangID implementation is at the URL below my name.
>>
>> Otis --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>>
>> ________________________________
>> From: revathy arun <re...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tuesday, February 17, 2009 6:39:40 PM
>> Subject: Re: Multilanguage
>>
>> Does Apache Tika help find the language of the given document?
>>
>>
>>
>> On 2/17/09, Till Kinstler <ki...@gbv.de> wrote:
>>>
>>> Paul Libbrecht schrieb:
>>>
>>> Clearly, then, something that matches words in a dictionary and
>>> decides
>> on
>>>> the language based on the language of the majority could do a
>>>> decent job
>> to
>>>> decide the analyzer.
>>>>
>>>> Does such a tool exist?
>>>>
>>>
>>> I once played around with http://ngramj.sourceforge.net/ for
>>> language
>>> guessing. It did a good job. It doesn't use dictionaries for
>>> language
>>> identification but a statistical approach using ngrams.
>>> I don't have any precise numbers, but out of about 10000 documents
>>> in
>>> different languages (most in English, German and French, few in
>>> other
>>> european languages like Polish) there were only some 10 not
>>> identified
>>> correctly.
>>>
>>> Till
>>>
>>> --
>>> Till Kinstler
>>> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
>>> Platz der Göttinger Sieben 1, D 37073 Göttingen
>>> kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>>>
>>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
Re: Multilanguage
Posted by revathy arun <re...@gmail.com>.
Hi Otis,
But this is not freeware ,right?
On 2/17/09, Otis Gospodnetic <ot...@yahoo.com> wrote:
>
> Hi,
>
> No, Tika doesn't do LangID. I haven't used ngramj, so I can't speak for
> its accuracy nor speed (but I know the code has been around for
> years). Another LangID implementation is at the URL below my name.
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
> ________________________________
> From: revathy arun <re...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, February 17, 2009 6:39:40 PM
> Subject: Re: Multilanguage
>
> Does Apache Tika help find the language of the given document?
>
>
>
> On 2/17/09, Till Kinstler <ki...@gbv.de> wrote:
> >
> > Paul Libbrecht schrieb:
> >
> > Clearly, then, something that matches words in a dictionary and decides
> on
> >> the language based on the language of the majority could do a decent job
> to
> >> decide the analyzer.
> >>
> >> Does such a tool exist?
> >>
> >
> > I once played around with http://ngramj.sourceforge.net/ for language
> > guessing. It did a good job. It doesn't use dictionaries for language
> > identification but a statistical approach using ngrams.
> > I don't have any precise numbers, but out of about 10000 documents in
> > different languages (most in English, German and French, few in other
> > european languages like Polish) there were only some 10 not identified
> > correctly.
> >
> > Till
> >
> > --
> > Till Kinstler
> > Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> > Platz der Göttinger Sieben 1, D 37073 Göttingen
> > kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
> >
>
Re: Multilanguage
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,
No, Tika doesn't do LangID. I haven't used ngramj, so I can't speak for its accuracy nor speed (but I know the code has been around for years). Another LangID implementation is at the URL below my name.
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
________________________________
From: revathy arun <re...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage
Does Apache Tika help find the language of the given document?
On 2/17/09, Till Kinstler <ki...@gbv.de> wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the analyzer.
>>
>> Does such a tool exist?
>>
>
> I once played around with http://ngramj.sourceforge.net/ for language
> guessing. It did a good job. It doesn't use dictionaries for language
> identification but a statistical approach using ngrams.
> I don't have any precise numbers, but out of about 10000 documents in
> different languages (most in English, German and French, few in other
> european languages like Polish) there were only some 10 not identified
> correctly.
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>
Re: Multilanguage
Posted by revathy arun <re...@gmail.com>.
Does Apache Tika help find the language of the given document?
On 2/17/09, Till Kinstler <ki...@gbv.de> wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the analyzer.
>>
>> Does such a tool exist?
>>
>
> I once played around with http://ngramj.sourceforge.net/ for language
> guessing. It did a good job. It doesn't use dictionaries for language
> identification but a statistical approach using ngrams.
> I don't have any precise numbers, but out of about 10000 documents in
> different languages (most in English, German and French, few in other
> european languages like Polish) there were only some 10 not identified
> correctly.
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>
Re: Multilanguage
Posted by Till Kinstler <ki...@gbv.de>.
Paul Libbrecht schrieb:
> Clearly, then, something that matches words in a dictionary and decides
> on the language based on the language of the majority could do a decent
> job to decide the analyzer.
>
> Does such a tool exist?
I once played around with http://ngramj.sourceforge.net/ for language
guessing. It did a good job. It doesn't use dictionaries for language
identification but a statistical approach using ngrams.
I don't have any precise numbers, but out of about 10000 documents in
different languages (most in English, German and French, few in other
european languages like Polish) there were only some 10 not identified
correctly.
Till
--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
Re: Multilanguage
Posted by Paul Libbrecht <pa...@activemath.org>.
I was looking for such a tool and haven't found it yet.
Using StandardAnalyzer one can obtain some form of token-stream which
can be used for "agnostic analysis".
Clearly, then, something that matches words in a dictionary and
decides on the language based on the language of the majority could do
a decent job to decide the analyzer.
Does such a tool exist?
It doesn't seem too hard for Lucene.
paul
Le 17-févr.-09 à 04:44, Otis Gospodnetic a écrit :
> The best option would be to identify the language after parsing the
> PDF and then index it using an appropriate analyzer defined in
> schema.xml.
Re: Multilanguage
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,
The best option would be to identify the language after parsing the PDF and then index it using an appropriate analyzer defined in schema.xml.
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
________________________________
From: revathy arun <re...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 1:42:07 PM
Subject: Multilanguage
Hi,
I have a scenario where ,i need to convert pdf content to text and then
index the same at run time .I do not know as to what language the pdf would
be ,in this case which is the best soln i have with respect the content
field type in the schema where the text content would be indexed to?
That is can i use the default tokenizer for all languages and since i would
not know the language and hence would not be able to stem the
tokens,how would this impact search?Is there any other solution for the
same?
Rgds