You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Antony Bowesman <ad...@teamware.com> on 2006/10/13 09:42:57 UTC

Analyzers and multiple languages

Hello,

I'm new to Lucene and wanted some advice on analyzers, stemmers and language 
analysis.  I've got LIA, so have read it's chapters.

I am writing a framework that needs to be able to index documents from a range 
of languages where just the character set of the document is known.  Has anyone 
looked at or is using language analysis to determine the language of a document 
in ISO-8859-1.

Is it worth doing or does StandardAnalyzer cope well with most European 
languages as long as it is provided with a suitable multi-lingual set of stop words.

What about stemming?  I see Google now says it does stemming, but again here 
language detection seems to be a stumbling block in the way of choosing the 
right stemmer.  Does stemming provide much of an index size reduction and is it 
actually useful in search?

Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analyzers and multiple languages

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Oct 13, 2006, at 3:42 AM, Antony Bowesman wrote:
> I am writing a framework that needs to be able to index documents  
> from a range of languages where just the character set of the  
> document is known.  Has anyone looked at or is using language  
> analysis to determine the language of a document in ISO-8859-1.

There is a language identifier plugin in the Nutch codebase that  
could surely be distilled (and there are plans to do so) into a  
standalone library:

	<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/ 
languageidentifier/>


> What about stemming?  I see Google now says it does stemming, but  
> again here language detection seems to be a stumbling block in the  
> way of choosing the right stemmer.  Does stemming provide much of  
> an index size reduction and is it actually useful in search?

Stemming shouldn't be considered for reducing index size, but rather  
to improve a users experience in findability.  It is quite useful in  
the right situations, but it is not something that all projects  
desire, so you'd have to see if it fits your needs specifically.

	Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analyzers and multiple languages

Posted by Erick Erickson <er...@gmail.com>.

This won't be *really* helpful, but I remember this being discussed at some
length a while ago. You'd be able to see some good info if you searched the
list archive, probably for language

I didn't pay much attention since this isn't something I'm concerned with
lately, so I can't be much real help...

Best
Erick

On 10/13/06, Antony Bowesman <ad...@teamware.com> wrote:
>
> Hello,
>
> I'm new to Lucene and wanted some advice on analyzers, stemmers and
> language
> analysis.  I've got LIA, so have read it's chapters.
>
> I am writing a framework that needs to be able to index documents from a
> range
> of languages where just the character set of the document is known.  Has
> anyone
> looked at or is using language analysis to determine the language of a
> document
> in ISO-8859-1.
>
> Is it worth doing or does StandardAnalyzer cope well with most European
> languages as long as it is provided with a suitable multi-lingual set of
> stop words.
>
> What about stemming?  I see Google now says it does stemming, but again
> here
> language detection seems to be a stumbling block in the way of choosing
> the
> right stemmer.  Does stemming provide much of an index size reduction and
> is it
> actually useful in search?
>
> Antony
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Analyzers and multiple languages

Posted by Mark Miller <ma...@gmail.com>.

Generally, stemming is not a method for index size reduction even though
that might be a side effect. It is very useful in search however...you would
generally want a search for skiing to also hit ski and skier (I can't spell
so don't get caught up on that). There are lots of those examples...if you
are doing general search, stemming is great, if not quite as great as
lemmatization. Look at the Snowball stemmers in contrib. The stemming king
wrote them I believe.

Language recognition can be a pain in the ass. Do some google searching and
check out this:
http://en.wikipedia.org/wiki/Language_recognition_chart

- Mark

On 10/13/06, Antony Bowesman <ad...@teamware.com> wrote:
>
> Hello,
>
> I'm new to Lucene and wanted some advice on analyzers, stemmers and
> language
> analysis.  I've got LIA, so have read it's chapters.
>
> I am writing a framework that needs to be able to index documents from a
> range
> of languages where just the character set of the document is known.  Has
> anyone
> looked at or is using language analysis to determine the language of a
> document
> in ISO-8859-1.
>
> Is it worth doing or does StandardAnalyzer cope well with most European
> languages as long as it is provided with a suitable multi-lingual set of
> stop words.
>
> What about stemming?  I see Google now says it does stemming, but again
> here
> language detection seems to be a stumbling block in the way of choosing
> the
> right stemmer.  Does stemming provide much of an index size reduction and
> is it
> actually useful in search?
>
> Antony
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Analyzers and multiple languages

Posted by Soeren Pekrul <so...@gmx.de>.

Hello Antony,

I have a similar problem. My collection contains mainly German 
documents, but some in English and few in French, Spain and Latin. I 
know that each language has its own stemming rules.

Language detection is not my domain. But I can imagine it could be 
possible to detect the language of a document by statistics methods like 
character based n-grams. "Ä", "ö", "ü", "ß" are quite often used in 
German words, “th” could indicate English and so on. It is probably more 
complex. Matching stop words of a language in a document could be 
another or additional way. How ever, let’s say I can detect the language 
of a document. Than I would use an analyzer or stemmer in the language 
of the document.

Now I see two other problems. Quite often you will find mainly English 
terms in non English documents. You will use for these terms the wrong 
analyzer. Another problem is the query. You should use the same analyzer 
for indexing the documents and parsing the queries. The query is usually 
to short for statistical methods, and you will find stop words in a 
query not so often.

So I decide for my task to use one analyzer for all documents and the 
queries. I use the stemmer of the most probably language of my 
documents. That is not perfect but should be OK.

Sören

Antony Bowesman wrote:
> Hello,
> 
> I'm new to Lucene and wanted some advice on analyzers, stemmers and 
> language analysis.  I've got LIA, so have read it's chapters.
> 
> I am writing a framework that needs to be able to index documents from a 
> range of languages where just the character set of the document is 
> known.  Has anyone looked at or is using language analysis to determine 
> the language of a document in ISO-8859-1.
> 
> Is it worth doing or does StandardAnalyzer cope well with most European 
> languages as long as it is provided with a suitable multi-lingual set of 
> stop words.
> 
> What about stemming?  I see Google now says it does stemming, but again 
> here language detection seems to be a stumbling block in the way of 
> choosing the right stemmer.  Does stemming provide much of an index size 
> reduction and is it actually useful in search?
> 
> Antony

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analyzers and multiple languages (language detection)

Posted by Bob Carpenter <ca...@alias-i.com>.

Antony Bowesman wrote:
> Hello,
> 
> I'm new to Lucene and wanted some advice on analyzers, stemmers and 
> language analysis.  I've got LIA, so have read it's chapters.
> 
> I am writing a framework that needs to be able to index documents from a 
> range of languages where just the character set of the document is 
> known.  Has anyone looked at or is using language analysis to determine 
> the language of a document in ISO-8859-1.

Language ID is pretty easy.  The best way to
do it wholly within Lucene would be with a
separate index containing one document per
language, with an analyzer that returned weighted
character n-grams.  You can read about our analyzer
to do that in LIA.  This is what some
of the packages such as Gertjan van Noord's do.

If you need very high accuracy, you could also
use our language ID, which is based on a probabilistic
classifier.  You can check out our tutorial at:

http://www.alias-i.com/lingpipe/demos/tutorial/langid/read-me.html

Accuracy depends on the pair of languages (some are
more confusible than others), as well as length of
input (it's very hard with only one or two words,
especially if it's a a name).

- Bob Carpenter
   Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org