You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Jérôme Charron <je...@gmail.com> on 2011/10/24 15:18:57 UTC

Google's Compact Language Detector

Hi,

I just find this blog post from Mike McCandless about Google's Compact
Language Detection code used in Chrome :
http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html

There's probably some interesting things to explore in the Google Code in
order to improve Tika's Language Detection.
Did someone allready take a look at Google CLD code ?
http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/

Best regards

Jérôme

-- 
@jcharron
http://motre.ch/
http://jcharron.posterous.com/
http://www.shopreflex.fr/
http://www.staragora.com/

<http://feeds.feedburner.com/~r/Bligblagblog/~6/1>

Re: Google's Compact Language Detector

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Jerome,

Nice to hear from you my friend! 

I haven't taken a look at Mike's blog post or the LD code, but
it sounds interesting and worth a look. I'll check it out!

Cheers,
Chris

On Oct 24, 2011, at 6:18 AM, Jérôme Charron wrote:

> Hi,
> 
> I just find this blog post from Mike McCandless about Google's Compact
> Language Detection code used in Chrome :
> http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html
> 
> There's probably some interesting things to explore in the Google Code in
> order to improve Tika's Language Detection.
> Did someone allready take a look at Google CLD code ?
> http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/
> 
> Best regards
> 
> Jérôme
> 
> -- 
> @jcharron
> http://motre.ch/
> http://jcharron.posterous.com/
> http://www.shopreflex.fr/
> http://www.staragora.com/
> 
> <http://feeds.feedburner.com/~r/Bligblagblog/~6/1>


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Google's Compact Language Detector

Posted by Jérôme Charron <je...@gmail.com>.

Thanks Mike for sharing these tests.
There is clearly a performance issue regarding Tika run time.
As you noticed it, it will be interesting to see if the accuracy can be
increased by mixing the languages profiles of many libraries.
But not sure if the accuracy is depending only from the languages profiles
and not the algorithm too...


On Tue, Oct 25, 2011 at 18:12, Michael McCandless <lucene@mikemccandless.com
> wrote:

> OK I posted the 3rd post about CLD, this time testing perf by
> comparing to Tika and language-detection (Google Code project):
>
>
> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
>
> Net/net all three do very well (>= 97% accuracy); I had to remove 4
> languages from consideration because we don't support them.
>
> Tika seems to have a lot of trouble with Spanish (confuses w/
> Galician) and Danish (confuses with Dutch).
>
> Also, Tika's performance is substantially slow than the other two... not
> sure what's up.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
> > <kk...@transpac.com> wrote:
> >
> >> Sounds like a great idea - see the recent comment thread on
> https://issues.apache.org/jira/browse/TIKA-431 for some related
> discussions.
> >>
> >> And there's also https://issues.apache.org/jira/browse/TIKA-539
> >
> > Those do look related (if you swap charset in for language)!
> >
> > It's tricky to know just how much to "trust" what the server
> > (Content-Type HTTP header) and content (http-equiv meta tag) says,
> > though I do like CLD's approach: they never fully "trust" what was
> > declared but rather use the declaration as a hint to boost language
> > priors.
> >
> > And then to figure out what priors to assign for each hint they have
> > these tables trained from a large content set (10% of Base).
> >
> > If we have access to a biggish crawl we could presumably do something
> > similar, ie record how often the hint is wrong and translate that into
> > appropriate prior boosts, ie make it a hint instead of fully trusting
> > it.
> >
> > Does anyone know how ICU translates the encoding "hint" into priors
> > for each encoding?
> >
> >> Also, what will you be using to test language detection? WIkipedia
> pages?
> >
> > I'm using the corpus from here:
> >
> >
> http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/
> >
> > It's a random subset of europarl (1000 strings from each of 21 langs).
> >
> > Wikipedia would be great too!
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
>



-- 
--------
@jcharron <http://www.twitter.com/jcharron>
http://motre.ch/
http://jcharron.posterous.com/
http://www.shopreflex.fr/
http://www.staragora.com/

<http://feeds.feedburner.com/~r/Bligblagblog/~6/1>

Re: Google's Compact Language Detector

Posted by reinhard schwab <re...@aon.at>.

i have also compared tika performance with the nutch language detector
in version 1.0.
it seems that nutch is far better in performance than tika ( 5 to 6
times faster than nutch).
but my use case is so special (short texts ~ 140 characters length) and
i dont have  time to investigate, so i have not reported.
so may be you can compare with performance of language detector in nutch
1.0.
i know that tika language detector is derived from nutch,
but then has been reimplemented, code has been changed.
1 ngram , 2 ngrams and 4 ngram have been ommitted for a faster startup
time and smaller language profiles.

regards
reinhard

Am 25.10.2011 18:12, schrieb Michael McCandless:
> OK I posted the 3rd post about CLD, this time testing perf by
> comparing to Tika and language-detection (Google Code project):
>
>     http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
>
> Net/net all three do very well (>= 97% accuracy); I had to remove 4
> languages from consideration because we don't support them.
>
> Tika seems to have a lot of trouble with Spanish (confuses w/
> Galician) and Danish (confuses with Dutch).
>
> Also, Tika's performance is substantially slow than the other two... not
> sure what's up.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>   
>> On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
>> <kk...@transpac.com> wrote:
>>
>>     
>>> Sounds like a great idea - see the recent comment thread on https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.
>>>
>>> And there's also https://issues.apache.org/jira/browse/TIKA-539
>>>       
>> Those do look related (if you swap charset in for language)!
>>
>> It's tricky to know just how much to "trust" what the server
>> (Content-Type HTTP header) and content (http-equiv meta tag) says,
>> though I do like CLD's approach: they never fully "trust" what was
>> declared but rather use the declaration as a hint to boost language
>> priors.
>>
>> And then to figure out what priors to assign for each hint they have
>> these tables trained from a large content set (10% of Base).
>>
>> If we have access to a biggish crawl we could presumably do something
>> similar, ie record how often the hint is wrong and translate that into
>> appropriate prior boosts, ie make it a hint instead of fully trusting
>> it.
>>
>> Does anyone know how ICU translates the encoding "hint" into priors
>> for each encoding?
>>
>>     
>>> Also, what will you be using to test language detection? WIkipedia pages?
>>>       
>> I'm using the corpus from here:
>>
>>    http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/
>>
>> It's a random subset of europarl (1000 strings from each of 21 langs).
>>
>> Wikipedia would be great too!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>     
>

Re: Google's Compact Language Detector

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Oct 25, 2011 at 12:32 PM, Robert Muir <rc...@gmail.com> wrote:
> On Tue, Oct 25, 2011 at 12:12 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>
>> Tika seems to have a lot of trouble with Spanish (confuses w/
>> Galician) and Danish (confuses with Dutch).
>
> s/Dutch/Norwegian/

Woops, thanks!

Mike McCandless

http://blog.mikemccandless.com

Re: Google's Compact Language Detector

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Oct 25, 2011 at 12:12 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:

> Tika seems to have a lot of trouble with Spanish (confuses w/
> Galician) and Danish (confuses with Dutch).

s/Dutch/Norwegian/



-- 
lucidimagination.com

Re: Google's Compact Language Detector

Posted by Ken Krugler <kk...@transpac.com>.

On Oct 25, 2011, at 6:12pm, Michael McCandless wrote:

> OK I posted the 3rd post about CLD, this time testing perf by
> comparing to Tika and language-detection (Google Code project):
> 
>    http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
> 
> Net/net all three do very well (>= 97% accuracy); I had to remove 4
> languages from consideration because we don't support them.
> 
> Tika seems to have a lot of trouble with Spanish (confuses w/
> Galician) and Danish (confuses with Dutch).
> 
> Also, Tika's performance is substantially slow than the other two... not
> sure what's up.

I'm not surprised that Tika is slower than CLD, given the highly optimized nature of that code. Though 2 orders of magnitude is...painful.

I took a swing at this a while back, but didn't complete the patch.

The main issues I tried to solve were:

 - Tika processes all of the text in the document, which (for longer documents) slows it down significantly, versus sampling up to some limit.

 - The ProfilingWriter is very inefficient. Every character processed does an array copy, and every three characters triggers a new String()

-- Ken

> http://blog.mikemccandless.com
> 
> On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
>> <kk...@transpac.com> wrote:
>> 
>>> Sounds like a great idea - see the recent comment thread on https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.
>>> 
>>> And there's also https://issues.apache.org/jira/browse/TIKA-539
>> 
>> Those do look related (if you swap charset in for language)!
>> 
>> It's tricky to know just how much to "trust" what the server
>> (Content-Type HTTP header) and content (http-equiv meta tag) says,
>> though I do like CLD's approach: they never fully "trust" what was
>> declared but rather use the declaration as a hint to boost language
>> priors.
>> 
>> And then to figure out what priors to assign for each hint they have
>> these tables trained from a large content set (10% of Base).
>> 
>> If we have access to a biggish crawl we could presumably do something
>> similar, ie record how often the hint is wrong and translate that into
>> appropriate prior boosts, ie make it a hint instead of fully trusting
>> it.
>> 
>> Does anyone know how ICU translates the encoding "hint" into priors
>> for each encoding?
>> 
>>> Also, what will you be using to test language detection? WIkipedia pages?
>> 
>> I'm using the corpus from here:
>> 
>>    http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/
>> 
>> It's a random subset of europarl (1000 strings from each of 21 langs).
>> 
>> Wikipedia would be great too!
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: Google's Compact Language Detector

Posted by Michael McCandless <lu...@mikemccandless.com>.

OK I posted the 3rd post about CLD, this time testing perf by
comparing to Tika and language-detection (Google Code project):

    http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

Net/net all three do very well (>= 97% accuracy); I had to remove 4
languages from consideration because we don't support them.

Tika seems to have a lot of trouble with Spanish (confuses w/
Galician) and Danish (confuses with Dutch).

Also, Tika's performance is substantially slow than the other two... not
sure what's up.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
> <kk...@transpac.com> wrote:
>
>> Sounds like a great idea - see the recent comment thread on https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.
>>
>> And there's also https://issues.apache.org/jira/browse/TIKA-539
>
> Those do look related (if you swap charset in for language)!
>
> It's tricky to know just how much to "trust" what the server
> (Content-Type HTTP header) and content (http-equiv meta tag) says,
> though I do like CLD's approach: they never fully "trust" what was
> declared but rather use the declaration as a hint to boost language
> priors.
>
> And then to figure out what priors to assign for each hint they have
> these tables trained from a large content set (10% of Base).
>
> If we have access to a biggish crawl we could presumably do something
> similar, ie record how often the hint is wrong and translate that into
> appropriate prior boosts, ie make it a hint instead of fully trusting
> it.
>
> Does anyone know how ICU translates the encoding "hint" into priors
> for each encoding?
>
>> Also, what will you be using to test language detection? WIkipedia pages?
>
> I'm using the corpus from here:
>
>    http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/
>
> It's a random subset of europarl (1000 strings from each of 21 langs).
>
> Wikipedia would be great too!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>

Re: Google's Compact Language Detector

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
<kk...@transpac.com> wrote:

> Sounds like a great idea - see the recent comment thread on https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.
>
> And there's also https://issues.apache.org/jira/browse/TIKA-539

Those do look related (if you swap charset in for language)!

It's tricky to know just how much to "trust" what the server
(Content-Type HTTP header) and content (http-equiv meta tag) says,
though I do like CLD's approach: they never fully "trust" what was
declared but rather use the declaration as a hint to boost language
priors.

And then to figure out what priors to assign for each hint they have
these tables trained from a large content set (10% of Base).

If we have access to a biggish crawl we could presumably do something
similar, ie record how often the hint is wrong and translate that into
appropriate prior boosts, ie make it a hint instead of fully trusting
it.

Does anyone know how ICU translates the encoding "hint" into priors
for each encoding?

> Also, what will you be using to test language detection? WIkipedia pages?

I'm using the corpus from here:

    http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/

It's a random subset of europarl (1000 strings from each of 21 langs).

Wikipedia would be great too!

Mike McCandless

http://blog.mikemccandless.com

Re: Google's Compact Language Detector

Posted by Ken Krugler <kk...@transpac.com>.

Hi Mike,

Sounds like a great idea - see the recent comment thread on https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.

And there's also https://issues.apache.org/jira/browse/TIKA-539

Also, what will you be using to test language detection? WIkipedia pages?

-- Ken

On Oct 24, 2011, at 7:29pm, Michael McCandless wrote:

> I've only scratched the surface in figuring out how CLD
> works... excising the code and exposing a Python wrapper is much
> easier than actually understanding it!
> 
> It has some neat features, like passing in three possible "hints":
> 
>  * domain extension (fr boosts French)
> 
>  * declared encoding
> 
>  * declared language
> 
> It uses these hints to set pre-computed priors for top 3 languages.
> 
> It can optionally "abstain" from guessing if it thinks it's not very
> confident for certain matches.  It has an overall "reliable" bool that
> comes back, which is true if the match is high confidence (like Tika's
> isReasonablyCertain, though that's per-match).
> 
> But, you can't [easily] limit up front the set of languages to test
> like you can with Tika (I think?  You can just .addProfile() for each
> language you want?  Hmm though loading a LanguageProfile from a .ngp
> file looks like it's private inside LanguageIdentifier).
> 
> I'm trying to test Tika vs CLD vs the java language detect library
> (http://code.google.com/p/language-detection)... hoping to finish that
> soon and do a followon blog post.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Mon, Oct 24, 2011 at 9:45 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>> I took a quick look just now, though it's not really documented yet - in the process of being separated from inside of Chrome.
>> 
>> But looks like they store pre-calculated compression models for languages, and then figure out which model works best on the text being analyzed (which implies it has bytes with similar probabilistic distribution/sequencing).
>> 
>> -- Ken
>> 
>> On Oct 24, 2011, at 3:18pm, Jérôme Charron wrote:
>> 
>>> Hi,
>>> 
>>> I just find this blog post from Mike McCandless about Google's Compact
>>> Language Detection code used in Chrome :
>>> http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html
>>> 
>>> There's probably some interesting things to explore in the Google Code in
>>> order to improve Tika's Language Detection.
>>> Did someone allready take a look at Google CLD code ?
>>> http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/
>>> 
>>> Best regards
>>> 
>>> Jérôme
>>> 
>>> --
>>> @jcharron
>>> http://motre.ch/
>>> http://jcharron.posterous.com/
>>> http://www.shopreflex.fr/
>>> http://www.staragora.com/
>>> 
>>> <http://feeds.feedburner.com/~r/Bligblagblog/~6/1>
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: Google's Compact Language Detector

Posted by Michael McCandless <lu...@mikemccandless.com>.

I've only scratched the surface in figuring out how CLD
works... excising the code and exposing a Python wrapper is much
easier than actually understanding it!

It has some neat features, like passing in three possible "hints":

  * domain extension (fr boosts French)

  * declared encoding

  * declared language

It uses these hints to set pre-computed priors for top 3 languages.

It can optionally "abstain" from guessing if it thinks it's not very
confident for certain matches.  It has an overall "reliable" bool that
comes back, which is true if the match is high confidence (like Tika's
isReasonablyCertain, though that's per-match).

But, you can't [easily] limit up front the set of languages to test
like you can with Tika (I think?  You can just .addProfile() for each
language you want?  Hmm though loading a LanguageProfile from a .ngp
file looks like it's private inside LanguageIdentifier).

I'm trying to test Tika vs CLD vs the java language detect library
(http://code.google.com/p/language-detection)... hoping to finish that
soon and do a followon blog post.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Oct 24, 2011 at 9:45 AM, Ken Krugler
<kk...@transpac.com> wrote:
> I took a quick look just now, though it's not really documented yet - in the process of being separated from inside of Chrome.
>
> But looks like they store pre-calculated compression models for languages, and then figure out which model works best on the text being analyzed (which implies it has bytes with similar probabilistic distribution/sequencing).
>
> -- Ken
>
> On Oct 24, 2011, at 3:18pm, Jérôme Charron wrote:
>
>> Hi,
>>
>> I just find this blog post from Mike McCandless about Google's Compact
>> Language Detection code used in Chrome :
>> http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html
>>
>> There's probably some interesting things to explore in the Google Code in
>> order to improve Tika's Language Detection.
>> Did someone allready take a look at Google CLD code ?
>> http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/
>>
>> Best regards
>>
>> Jérôme
>>
>> --
>> @jcharron
>> http://motre.ch/
>> http://jcharron.posterous.com/
>> http://www.shopreflex.fr/
>> http://www.staragora.com/
>>
>> <http://feeds.feedburner.com/~r/Bligblagblog/~6/1>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>

Re: Google's Compact Language Detector

Posted by Ken Krugler <kk...@transpac.com>.

I took a quick look just now, though it's not really documented yet - in the process of being separated from inside of Chrome.

But looks like they store pre-calculated compression models for languages, and then figure out which model works best on the text being analyzed (which implies it has bytes with similar probabilistic distribution/sequencing).

-- Ken

On Oct 24, 2011, at 3:18pm, Jérôme Charron wrote:

> Hi,
> 
> I just find this blog post from Mike McCandless about Google's Compact
> Language Detection code used in Chrome :
> http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html
> 
> There's probably some interesting things to explore in the Google Code in
> order to improve Tika's Language Detection.
> Did someone allready take a look at Google CLD code ?
> http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/
> 
> Best regards
> 
> Jérôme
> 
> -- 
> @jcharron
> http://motre.ch/
> http://jcharron.posterous.com/
> http://www.shopreflex.fr/
> http://www.staragora.com/
> 
> <http://feeds.feedburner.com/~r/Bligblagblog/~6/1>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr