You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jan Høydahl <ja...@cominvent.com> on 2012/04/09 01:16:47 UTC
Re: Pluggable language detection
In Solr, we made support for pluggable lang detectors, one being Tika's. See http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/langid/
The detectLanguage() method returns a list of DetectedLanguage objects with a normalized certainty between 0.0 and 1.0. Think it's a step in right direction.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com
On 22. mars 2012, at 11:22, Julien Nioche wrote:
> If you mean integrating a better third-party detector - that's exactly my
> point. We don't develop and maintain our own parsers, why should we follow
> a different logic when it comes to language identification? There are other
> resource around why don't we just use them? I assume that by default our
> existing detector (improved or not) could still be used, all we need is
> just a mechanism to be able to select an alternative implementation and a
> common interface. That's probably not a big deal to implement. Any thoughts
> on how to do it? Are there any things we should reuse from the way we deal
> with the parsers?
>
> Thanks for your comments
>
> Julien
>
>
> On 21 March 2012 16:55, Ken Krugler <kk...@transpac.com> wrote:
>
>>
>> On Mar 21, 2012, at 8:51am, Julien Nioche wrote:
>>
>>> Hi guys,
>>>
>>> Just wondering about the best way to make the language detection
>> pluggable
>>> instead of having it hard-wired as it is now. We now that the resources
>>> that are currently in Tika are both slow and inaccurate [1] and there are
>>> other libraries that we could leverage. Why not having the option to
>> select
>>> a different implementation just like we do for parsers? Obviously we'd
>> need
>>> a common interface for the parsers etc...
>>>
>>> What do you think?
>>
>> I'd be more in favor of using that time to integrate a better language
>> detector into Tika, so that everybody wins from the work :)
>>
>> -- Ken
>>
>>
>>> [1]
>>>
>> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>>
>>
>>
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
Re: Pluggable language detection
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Jan,
It probably makes sense to provide pluggable language detection in Tika, since it's the lower level library,
so I am +1 for figuring out a solution to implement it in Tika ville.
If no one has started on this in the next few weeks I'll give it a go.
Cheers,
Chris
On Apr 8, 2012, at 4:16 PM, Jan Høydahl wrote:
> In Solr, we made support for pluggable lang detectors, one being Tika's. See http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/langid/
> The detectLanguage() method returns a list of DetectedLanguage objects with a normalized certainty between 0.0 and 1.0. Think it's a step in right direction.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
[...snip...]
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++