You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Kevin Burton <bu...@gmail.com> on 2005/08/19 23:42:26 UTC

NGram Language Categorization Source

Hey lucene guys.

I know for a fact that a bunch of you have been curious about language
categorization for a long time now and Java has lacked a solid way to
solve this problem.

Anyway.  This new library that I just released should be easy to tie
into your lucene indexers.  Just use the library on a text (strip the
HTML) and then create a new field in Lucene called LANG (or soemthing)
and then create a filter before you search with JUST that language
code.

I'd love some help with filling out missing languages if anyone has
some spare time.  That help make up for all the hard work I've done
here (nudge.. nudge)

I did a full research of the lang categorization space for Java and I
think this is basically the only library out there.

Good luck
...

I'm working on a blog post describing how blog search engines like
Technorati, PubSub, and Feedster could/should use language
categorization to help deal with the chaos of tagging and full-text
search. Google has done this for a long time now and Technorati has it
in beta.

http://www.feedblog.org/2005/08/ngram_language_.html

-- 
 Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NGram Language Categorization Source

Posted by Andrzej Bialecki <ab...@getopt.org>.

Kevin Burton wrote:

>>A lot depends on the reference profiles (which in turn depend on the
>>quality of your training corpus - in this case, your corpus is not the
>>best choice, because each text contains a lot of foreign words).
> 
> 
> I realize that my corpus isnt' the best.  That's one of the reason's
> I've open source'd it.  The main improvement in ngramcat (my code) is
> that if the result isn't obvious we throw an Exception so
> theoreticallyi we won't see any false positives unless the language
> categorization is WAY off.

That's also how other implementations do it - you need to set an 
arbitrary threshold, and if the profiles score below that threshold then 
an "unknown" value is returned (or null, or Exception).

> 
> 
>>It was
>>also found that the way you create ngram profiles (e.g. with or without
>>surrounding spaces, single length or mixed length) affects the LI
>>performance. 
> 
> 
> LI???
> 

LI = Language Identification. Sorry for the confusion.

> I haven't benchmarked it but I'd be interested in any suggestions you have.
> 
> 
>>For documents with mixed languages it was also found that
>>methods, which combine ngrams with stopwords, work better.
> 
> 
> Hm.. interesting.. where?  URL I can reads?

Someone mentioned the Linguini paper, where they found that using "short 
words" features gives similarly good performance as using ngrams.

See also the following papers :

http://www.xrce.xerox.com/Publications/Attachments/1995-012/Gref---Comparing-two-language-identification-schemes.pdf
http://citeseer.ist.psu.edu/40861.html
http://www.xs4all.nl/~ajwp/langident.pdf

In general, using stop words works only for texts above certain minimum 
length (greater than with n-gram methods), and then gives nearly 100% 
accuracy.


>>So, there is still a lot to do in this area, if you come up with some
>>unique way of improving LI performance...
> 
> 
> Maybe I'm being dense but what is LI performance?

Language Identification performance - in the sense that a given 
identifier "performs" better if it can correctly identify more 
languages, using shorter input text.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NGram Language Categorization Source

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

Sounds like that LI acronym was confusing -Language Identification.

Otis


> > It was
> > also found that the way you create ngram profiles (e.g. with or
> without
> > surrounding spaces, single length or mixed length) affects the LI
> > performance. 
> 
> LI???
> 
> I haven't benchmarked it but I'd be interested in any suggestions you
> have.
> 
> > So, there is still a lot to do in this area, if you come up with
> some
> > unique way of improving LI performance...
> 
> Maybe I'm being dense but what is LI performance?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NGram Language Categorization Source

Posted by Kevin Burton <bu...@gmail.com>.

> Erhm... Not to rain on your parade, but Googling for "ngram java" gives
> a lot of hits. http://sourceforge.net/projects/ngramj and also
> "languageidentifier" in Nutch are two examples of Open Source Java
> implementations. Each can be used with Lucene.

I think I've played with ngramj and found it very lacking. 
 
I haven't played with 'languageidentifier' in Nutch ... 

> A lot depends on the reference profiles (which in turn depend on the
> quality of your training corpus - in this case, your corpus is not the
> best choice, because each text contains a lot of foreign words).

I realize that my corpus isnt' the best.  That's one of the reason's
I've open source'd it.  The main improvement in ngramcat (my code) is
that if the result isn't obvious we throw an Exception so
theoreticallyi we won't see any false positives unless the language
categorization is WAY off.

> It was
> also found that the way you create ngram profiles (e.g. with or without
> surrounding spaces, single length or mixed length) affects the LI
> performance. 

LI???

I haven't benchmarked it but I'd be interested in any suggestions you have.

> For documents with mixed languages it was also found that
> methods, which combine ngrams with stopwords, work better.

Hm.. interesting.. where?  URL I can reads?
 
> Additionally, simple methods based on cosine similarity (or delta
> ranking) don't give correct results for documents with mixed languages.
> In such cases input texts are chunked, and each chunk is analyzed
> separately, and then the scores are combined... etc, etc... millions of
> ways you can do this - and of course no method is perfect. :-)

Yes.  We don't handle the mixed language case very well.  The chunking
method is something I wanted to approach.

> So, there is still a lot to do in this area, if you come up with some
> unique way of improving LI performance...

Maybe I'm being dense but what is LI performance?

Thanks.

Kevin

-- 
 Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NGram Language Categorization Source

Posted by Andrzej Bialecki <ab...@getopt.org>.

Kevin Burton wrote:
> Hey lucene guys.
> 
> I know for a fact that a bunch of you have been curious about language
> categorization for a long time now and Java has lacked a solid way to
> solve this problem.
> 
> Anyway.  This new library that I just released should be easy to tie
> into your lucene indexers.  Just use the library on a text (strip the
> HTML) and then create a new field in Lucene called LANG (or soemthing)
> and then create a filter before you search with JUST that language
> code.
> 
> I'd love some help with filling out missing languages if anyone has
> some spare time.  That help make up for all the hard work I've done
> here (nudge.. nudge)
> 
> I did a full research of the lang categorization space for Java and I
> think this is basically the only library out there.

Erhm... Not to rain on your parade, but Googling for "ngram java" gives 
a lot of hits. http://sourceforge.net/projects/ngramj and also 
"languageidentifier" in Nutch are two examples of Open Source Java 
implementations. Each can be used with Lucene.

A lot depends on the reference profiles (which in turn depend on the 
quality of your training corpus - in this case, your corpus is not the 
best choice, because each text contains a lot of foreign words). It was 
also found that the way you create ngram profiles (e.g. with or without 
surrounding spaces, single length or mixed length) affects the LI 
performance. For documents with mixed languages it was also found that 
methods, which combine ngrams with stopwords, work better.

Additionally, simple methods based on cosine similarity (or delta 
ranking) don't give correct results for documents with mixed languages. 
In such cases input texts are chunked, and each chunk is analyzed 
separately, and then the scores are combined... etc, etc... millions of 
ways you can do this - and of course no method is perfect. :-)

So, there is still a lot to do in this area, if you come up with some 
unique way of improving LI performance...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NGram Language Categorization Source

Posted by Kevin Burton <bu...@gmail.com>.

> * A Nutch implementation:
> http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/languageidentifier/
> 
> * A Lucene patch: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763

A step in the right direction. It doesn't have other language
categories created though.

> * JTextCat (http://www.jedi.be/JTextCat/index.html),  a Java wrapper
> for libtextcat

Yes. I saw JTextCat.. I didn't want any JNI used. 

> * NGramJ (http://ngramj.sourceforge.net/), a general n-gram Java library

LGPL.. yuk. That said I think I reviewed this package and found it
lacking.  I started off just trying to find a library to use in our
crawler but never found anything.  Which is why I ended up writing my
own.

> Of these, the Nutch one is certainly under active development, the
> others don't seem to be as far as I can tell.

They should just use ngramcat :)

Kevin

-- 
 Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NGram Language Categorization Source

Posted by Tom White <to...@gmail.com>.

Hi Kevin,

On 8/19/05, Kevin Burton <bu...@gmail.com> wrote:
> Hey lucene guys.
> 
> I know for a fact that a bunch of you have been curious about language
> categorization for a long time now and Java has lacked a solid way to
> solve this problem.
> 
> Anyway.  This new library that I just released should be easy to tie
> into your lucene indexers.  Just use the library on a text (strip the
> HTML) and then create a new field in Lucene called LANG (or soemthing)
> and then create a filter before you search with JUST that language
> code.
> 
> I'd love some help with filling out missing languages if anyone has
> some spare time.  That help make up for all the hard work I've done
> here (nudge.. nudge)
> 
> I did a full research of the lang categorization space for Java and I
> think this is basically the only library out there.

I know of the following existing Java implementations of language
categorization:

* A Nutch implementation:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/languageidentifier/

* A Lucene patch: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763

* JTextCat (http://www.jedi.be/JTextCat/index.html),  a Java wrapper
for libtextcat

* NGramJ (http://ngramj.sourceforge.net/), a general n-gram Java library

Of these, the Nutch one is certainly under active development, the
others don't seem to be as far as I can tell.

> 
> Good luck
> ...
> 
> I'm working on a blog post describing how blog search engines like
> Technorati, PubSub, and Feedster could/should use language
> categorization to help deal with the chaos of tagging and full-text
> search. Google has done this for a long time now and Technorati has it
> in beta.
> 
> http://www.feedblog.org/2005/08/ngram_language_.html
> 

I like your idea of using Wikipedia translations as the training
corpus - it's a good way to get fairly reliable sources for lots of
languages.

Regards,

Tom

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NGram Language Categorization Source

Posted by Kevin Burton <bu...@gmail.com>.

><ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf
> >
> >     Linguini: Language Identification for Multilingual Documents
> >     John M. Prager
> 
> Prager also uses an n-gram approach, so you might be able to take
> advantage of some of his research into optimal values for <n>.

Yeah.. though to be honest I as long as you're on the long tail
portion of N the values won't matter much I think.

All you'll do is waste a bit of memory (like 1k)
 
> The code to Linguini doesn't seem to be available (you have to
> purchase some IBM product(s) to get it) so what you've done is great
> for the open source community - thanks!
> 
> Also I could post to the Unicode list re training data in multiple
> languages, as that's a good place to find out about multilingual
> corpora.

Yeah. That was my biggest problem. This area had never really been
solved in the OSS world.

-- 
 Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NGram Language Categorization Source

Posted by Ken Krugler <kk...@transpac.com>.

Hi Kevin,

>I know for a fact that a bunch of you have been curious about language
>categorization for a long time now and Java has lacked a solid way to
>solve this problem.
>
>Anyway.  This new library that I just released should be easy to tie
>into your lucene indexers.  Just use the library on a text (strip the
>HTML) and then create a new field in Lucene called LANG (or soemthing)
>and then create a filter before you search with JUST that language
>code.
>
>I'd love some help with filling out missing languages if anyone has
>some spare time.  That help make up for all the hard work I've done
>here (nudge.. nudge)
>
>I did a full research of the lang categorization space for Java and I
>think this is basically the only library out there.

[snip]

Recently I'd posted the following to the Nutch mailing list, since 
the topic of determining web page languages had come up there as well:

>Given the recent discussion regarding charset/language detection on 
>this list, people might find this IBM reseearch paper interesting:
>
><ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf
>
>     Linguini: Language Identification for Multilingual Documents
>     John M. Prager

Prager also uses an n-gram approach, so you might be able to take 
advantage of some of his research into optimal values for <n>.

The code to Linguini doesn't seem to be available (you have to 
purchase some IBM product(s) to get it) so what you've done is great 
for the open source community - thanks!

Also I could post to the Unicode list re training data in multiple 
languages, as that's a good place to find out about multilingual 
corpora.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org