You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jan Høydahl <ja...@cominvent.com> on 2012/04/09 01:16:47 UTC

Re: Pluggable language detection

In Solr, we made support for pluggable lang detectors, one being Tika's. See http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/langid/
The detectLanguage() method returns a list of DetectedLanguage objects with a normalized certainty between 0.0 and 1.0. Think it's a step in right direction.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 22. mars 2012, at 11:22, Julien Nioche wrote:

> If you mean integrating a better third-party detector - that's exactly my
> point. We don't develop and maintain our own parsers, why should we follow
> a different logic when it comes to language identification? There are other
> resource around why don't we just use them? I assume that by default our
> existing detector (improved or not) could still be used, all we need is
> just a mechanism to be able to select an alternative implementation and a
> common interface. That's probably not a big deal to implement. Any thoughts
> on how to do it? Are there any things we should reuse from the way we deal
> with the parsers?
> 
> Thanks for your comments
> 
> Julien
> 
> 
> On 21 March 2012 16:55, Ken Krugler <kk...@transpac.com> wrote:
> 
>> 
>> On Mar 21, 2012, at 8:51am, Julien Nioche wrote:
>> 
>>> Hi guys,
>>> 
>>> Just wondering about the best way to make the language detection
>> pluggable
>>> instead of having it hard-wired as it is now. We now that the resources
>>> that are currently in Tika are both slow and inaccurate [1] and there are
>>> other libraries that we could leverage. Why not having the option to
>> select
>>> a different implementation just like we do for parsers? Obviously we'd
>> need
>>> a common interface for the parsers etc...
>>> 
>>> What do you think?
>> 
>> I'd be more in favor of using that time to integrate a better language
>> detector into Tika, so that everybody wins from the work :)
>> 
>> -- Ken
>> 
>> 
>>> [1]
>>> 
>> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
>>> 
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>> 
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>> 
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble


Re: Pluggable language detection

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Jan,

It probably makes sense to provide pluggable language detection in Tika, since it's the lower level library, 
so I am +1 for figuring out a solution to implement it in Tika ville.

If no one has started on this in the next few weeks I'll give it a go.

Cheers,
Chris

On Apr 8, 2012, at 4:16 PM, Jan Høydahl wrote:

> In Solr, we made support for pluggable lang detectors, one being Tika's. See http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/langid/
> The detectLanguage() method returns a list of DetectedLanguage objects with a normalized certainty between 0.0 and 1.0. Think it's a step in right direction.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
> 
[...snip...]

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++