You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Jan Høydahl <ja...@cominvent.com> on 2023/03/02 19:21:50 UTC

[DISCUSS] Language detection in Solr

Hi,

Solr supports pluggable language detectors <https://solr.apache.org/guide/solr/latest/indexing-guide/language-detection.html>:

> Solr supports three implementations of this feature:
> 
> Tika’s language detection feature: https://tika.apache.org/1.28.4/detection.html
> LangDetect language detection: https://github.com/shuyo/language-detection
> OpenNLP language detection: http://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.langdetect

Since our first implementation, the Tika project <https://tika.apache.org/2.7.0/detection.html#Language_Detection> has evolved it's language detection capabilities and added a pluggable architecture as well:
https://github.com/apache/tika/tree/main/tika-langdetect

One of Solr's langid plugins is "langdetect" which has not been updated in 10 years. I'd like to deprecate it and remove it in main for that reason.

Longer term question: Does it make sense for us to keep maintaining our own set of language detectors in this landscape?
We could re-purpose the langid module so that uses Tika's pluggable detectors in some way, perhaps with thin wrapper classes in Solr?

Wdyt?

Jan

Re: [DISCUSS] Language detection in Solr

Posted by Eric Pugh <ep...@opensourceconnections.com>.
+1 agreed for Tika and deprecating langdetect. 

I was surprised to see how hold langdetect was!


> On Mar 6, 2023, at 4:43 PM, Alessandro Benedetti <a....@sease.io> wrote:
> 
> +1 for delegating to Tika which is a much better place for that (and that
> they are actively evolving).
> 
> +1 for deprecating the old and not updated plugins as well (langdetect)
> 
> Cheers
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
> 
> e-mail: a.benedetti@sease.io
> 
> 
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
> 
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
> 
> 
> On Thu, 2 Mar 2023 at 20:22, Jan Høydahl <ja...@cominvent.com> wrote:
> 
>> Hi,
>> 
>> Solr supports pluggable language detectors <
>> https://solr.apache.org/guide/solr/latest/indexing-guide/language-detection.html
>>> :
>> 
>>> Solr supports three implementations of this feature:
>>> 
>>> Tika’s language detection feature:
>> https://tika.apache.org/1.28.4/detection.html
>>> LangDetect language detection:
>> https://github.com/shuyo/language-detection
>>> OpenNLP language detection:
>> http://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.langdetect
>> 
>> Since our first implementation, the Tika project <
>> https://tika.apache.org/2.7.0/detection.html#Language_Detection> has
>> evolved it's language detection capabilities and added a pluggable
>> architecture as well:
>> https://github.com/apache/tika/tree/main/tika-langdetect
>> 
>> One of Solr's langid plugins is "langdetect" which has not been updated in
>> 10 years. I'd like to deprecate it and remove it in main for that reason.
>> 
>> Longer term question: Does it make sense for us to keep maintaining our
>> own set of language detectors in this landscape?
>> We could re-purpose the langid module so that uses Tika's pluggable
>> detectors in some way, perhaps with thin wrapper classes in Solr?
>> 
>> Wdyt?
>> 
>> Jan

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Re: [DISCUSS] Language detection in Solr

Posted by Alessandro Benedetti <a....@sease.io>.
+1 for delegating to Tika which is a much better place for that (and that
they are actively evolving).

+1 for deprecating the old and not updated plugins as well (langdetect)

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Thu, 2 Mar 2023 at 20:22, Jan Høydahl <ja...@cominvent.com> wrote:

> Hi,
>
> Solr supports pluggable language detectors <
> https://solr.apache.org/guide/solr/latest/indexing-guide/language-detection.html
> >:
>
> > Solr supports three implementations of this feature:
> >
> > Tika’s language detection feature:
> https://tika.apache.org/1.28.4/detection.html
> > LangDetect language detection:
> https://github.com/shuyo/language-detection
> > OpenNLP language detection:
> http://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.langdetect
>
> Since our first implementation, the Tika project <
> https://tika.apache.org/2.7.0/detection.html#Language_Detection> has
> evolved it's language detection capabilities and added a pluggable
> architecture as well:
> https://github.com/apache/tika/tree/main/tika-langdetect
>
> One of Solr's langid plugins is "langdetect" which has not been updated in
> 10 years. I'd like to deprecate it and remove it in main for that reason.
>
> Longer term question: Does it make sense for us to keep maintaining our
> own set of language detectors in this landscape?
> We could re-purpose the langid module so that uses Tika's pluggable
> detectors in some way, perhaps with thin wrapper classes in Solr?
>
> Wdyt?
>
> Jan