You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2020/09/04 12:57:05 UTC

Re: [jira] [Commented] (SOLR-13973) Deprecate Tika

Let’s discuss how we can accommodate Drupal and Solarium (and others) with a minimal amount of pain to those projects rather than insist on keeping things the way they are. We’ve been recommending against using ExtractingRequestHandler in production for years and apparently that advice has been ignored. We all struggle with tech debt, and this is another example. Solr shouldn’t be constrained by another project’s tech debt.

For instance, have you looked at Tika Server? See: https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-InstallationofTikaServer. IDK whether that’s a viable solution or not but it seems like an option worth exploring, and has been raised in SOLR-7632.

Best,
Erick


> On Sep 4, 2020, at 7:16 AM, Markus Kalkbrenner (Jira) <ji...@apache.org> wrote:
> 
> 
>    [ https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190681#comment-17190681 ] 
> 
> Markus Kalkbrenner commented on SOLR-13973:
> -------------------------------------------
> 
> {quote}Perhaps even a simple Tika integration in SolrJ would make sense, making it super simple to do the extraction on client side, which is probably what most users should consider anyway.
> {quote}
> As maintainer of Solarium, the major PHP Client for Solr, and of the Solr-Drupal-Integration I know that there're users and Solr Service Providers who rely on the ExtractionHandler and the out-of-the-box experience as [~AndrewGr] described. Even if I understand your motivation as a developer, moving the workflow to the client side  will put a significant work load on other developers, even if you add Tika support to SolrJ.
> Maybe the amount of people who use Solr in combination with a different programming language is higher compared to the amount of Java projects which use SolrJ.
> 
> There're more than 40,000 active Drupal installations using Solr as search backend today:
> [https://www.drupal.org/project/usage/search_api_solr]
> 
> github lists 895 repositories that directly depend on the PHP solarium library:
> [https://github.com/solariumphp/solarium/network/dependents]
> 
> These includes packages from other PHP frameworks like symfony, laravel, typo3, wordpress, ...
> 
> Nearly 200,000 composer based build processes of PHP projects pulled the solarium library within the last 30 days:
> [https://packagist.org/packages/solarium/solarium/stats#major/all]
> 
> For sure, just a few of all these installations will use Tika indirectly via the extraction handler. But it won't be an easy task to add a stand alone Tika server to their stack. I know a lot of hosters who don't provide it yet to their customers.
> 
> I won't say that you shouldn't deprecate the embedded Tika at all. But take careful steps and be aware of the fact that the community of Solr users might be much greater as you think due to the out-of-the-box solutions that exist, especially in the PHP world.
> 
> BTW SOLR-14768 has been detected automatically by the automated integration tests of the solarium library and also  by the automated integration tests of the Search API Solr Drupal module!
> 
>  
> 
>> Deprecate Tika
>> --------------
>> 
>>                Key: SOLR-13973
>>                URL: https://issues.apache.org/jira/browse/SOLR-13973
>>            Project: Solr
>>         Issue Type: Improvement
>>           Reporter: Ishan Chattopadhyaya
>>           Assignee: Ishan Chattopadhyaya
>>           Priority: Blocker
>>            Fix For: 8.7
>> 
>>         Time Spent: 10m
>> Remaining Estimate: 0h
>> 
>> Solr's primary responsibility should be to focus on search and scalability. Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us down. I propose that we deprecate it going forward.
>> Tika can be run outside Solr. Going forward, if someone wants to use these, it should be possible to bring them into third party packages and installed via package manager.
>> Plan is to just to throw warnings in logs and add deprecation notes in reference guide for now. Removal can be done in 9.0.
> 
> 
> 
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
> For additional commands, e-mail: issues-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org