You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Eric Pugh <ep...@opensourceconnections.com> on 2019/12/04 17:24:22 UTC
Do we have a community supported approach for deploying Tika Server
in production?
Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
Over in Solr land there has been renewed discussion about streamlining what Solr is....
In regards to rich content extraction and the Tika project, it seems like the two ideas that continue to preserve the existing behavior are:
1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr. This slims down the standard Solr download, and *might* make it easier to update the version of Tika + dependent jars used?
2) The second approach is to instead require Tika-Server to be running (https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate the call to Tika-Server.
I was thinking about why I like option 1 better than 2, and I think it boils down to how mature the IT organization I am working with is. Some IT organizations have large dev-ops teams, and are working at major scale, and managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically scaling up and down is simple and second nature! However, many organizations aren’t like that.
So I guess what I’m asking is do we have a reasonable supported approach for deploying Tika Server for non-tika savvy organizations? I’m thinking about Solr, and specifically the fact that Solr has a well defined set of Service Installation scripts. When I follow the directions in https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production I can feel confident that when the server is rebooted, then Solr will come back up! Plus there is log rotation and all the rest.
In contrast, when I look at Tika website, specifically https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run Tika as a command line application, or embedded in your application.
I’m wondering if Tika-Server needs to be made more prominent, and treated as the “primary method of interacting with Tika”? Do we need as a community to focus more on Tika-Server? In our getting started documentation, in our usage documentation, and in our examples?
Do we need to create the equivalent of the Service Installation scripts for Tika-Server?
Wanted to stoke the discussion!
Eric
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Eric Pugh <ep...@opensourceconnections.com>.
Dave, I pushed up TIKA-3039 with this change for your review and commit!
> On Feb 6, 2020, at 7:20 AM, Eric Pugh <ep...@opensourceconnections.com> wrote:
>
> Great!
>
>
>> On Feb 5, 2020, at 10:55 PM, David Meikle <david@meikle.io <ma...@meikle.io>> wrote:
>>
>> Hi Eric,
>>
>> +1 - I think we should drop that and rely on tika-docker instead.
>>
>> I'm about to push more to it tonight, and then we could include it as a
>> sub-module in Tika to do regular development snapshots too.
>>
>> Cheers,
>> Dave
>>
>> On Wed, 5 Feb 2020 at 15:34, Eric Pugh <epugh@opensourceconnections.com <ma...@opensourceconnections.com>>
>> wrote:
>>
>>> Following this thread, should we deprecate/remove the Tika Docker support
>>> that is in Tika-server project?
>>>
>>> The `mvn dockerfile:build` command now relies on a plugin that is no
>>> longer supported according to https://github.com/spotify/dockerfile-maven <https://github.com/spotify/dockerfile-maven>,
>>> and it seems like the Tika-docker project is really the right place for
>>> this!
>>>
>>> I’m thinking that this might help reduce the footprint of things we need
>>> to support.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On Jan 9, 2020, at 12:08 AM, Chris Mattmann <mattmann@apache.org <ma...@apache.org>> wrote:
>>>>
>>>> +1
>>>>
>>>>
>>>>
>>>> Note there is also a USC tika dockers repo where I put the data science
>>> stuff too:
>>>>
>>>>
>>>>
>>>> http://github.com/USCDataScience/tika-dockers <http://github.com/USCDataScience/tika-dockers>
>>>>
>>>>
>>>>
>>>> I’ll continue to push DL and ML Tika stuff there.
>>>>
>>>> Cheers,
>>>>
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> From: Dave Meikle <dm...@apache.org>
>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>> Date: Wednesday, January 8, 2020 at 2:18 PM
>>>> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
>>>> Subject: Re: [EXTERNAL] Do we have a community supported approach for
>>> deploying Tika Server in production?
>>>>
>>>>
>>>>
>>>> Hi Eric,
>>>>
>>>>
>>>>
>>>> Will take a look. On a related note, I've created a new repos:
>>>>
>>>> https://github.com/apache/tika-docker <https://github.com/apache/tika-docker>
>>>>
>>>>
>>>>
>>>> Thinking based on looking at the PRs and Issues on LogicalSpark
>>>>
>>>> docker-tikaserver, I'll create an updated docker file using what you've
>>>>
>>>> added here and look to publish builds to docker hub from that.
>>>>
>>>>
>>>>
>>>> What do you think?
>>>>
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Dave
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 8 Jan 2020 at 03:16, Eric Pugh <ep...@opensourceconnections.com>
>>>>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Hi all, I’ve gone ahead and added the -spawnChild property as a default
>>>>
>>>> when running Tika Server as a service. I’d love some eyes on the PR,
>>> and
>>>>
>>>> if this looks good, get it committed.
>>>>
>>>>
>>>>
>>>> Feedback welcome!
>>>>
>>>>
>>>>
>>>> Eric
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On Dec 17, 2019, at 12:53 PM, Eric Pugh <
>>> epugh@opensourceconnections.com <ma...@opensourceconnections.com>>
>>>>
>>>> wrote:
>>>>
>>>>>
>>>>
>>>>> Cool.
>>>>
>>>>>
>>>>
>>>>> It’s the auto run that I really need, and the other part that I don’t
>>>>
>>>> think I’ve tackled properly is the managing of logs…
>>>>
>>>>>
>>>>
>>>>> I’m going to check with my project to see if they support Snap packages.
>>>>
>>>>>
>>>>
>>>>> Eric
>>>>
>>>>>
>>>>
>>>>>
>>>>
>>>>>> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk <ma...@spicule.co.uk> <mailto:
>>>>
>>>> tom@spicule.co.uk <ma...@spicule.co.uk>>> wrote:
>>>>
>>>>>>
>>>>
>>>>>> Just saw this fly by and FYI on Linux systems that support Snap
>>>>
>>>> packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install
>>> tika-server`
>>>>
>>>> doesn’t yet auto-run I don’t believe but you can just run
>>> `tika-server.run`
>>>>
>>>> and adding an init script wouldn’t take 5 minutes.
>>>>
>>>>>>
>>>>
>>>>>> Tom
>>>>
>>>>>>
>>>>
>>>>>> On 16 December 2019 at 18:42:55, Eric Pugh (
>>>>
>>>> epugh@opensourceconnections.com <ma...@opensourceconnections.com> <mailto:epugh@opensourceconnections.com <ma...@opensourceconnections.com>
>>>> )
>>>>
>>>> wrote:
>>>>
>>>>>>
>>>>
>>>>>>> Hi folks!
>>>>
>>>>>>>
>>>>
>>>>>>> I’ve got a mostly completed PR for having install scripts for Tika
>>>>
>>>> Server, and I’m hoping a committer will take a look at the PR, and give
>>>>
>>>> feedback (and ideally commit in time for 1.24!)
>>>>
>>>>>>>
>>>>
>>>>>>> A couple of things:
>>>>
>>>>>>>
>>>>
>>>>>>> 1) This was completely influenced by
>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>
>>>>
>>>> <
>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>>>
>>>>> <
>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>>>
>>>> <
>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>>>> ,
>>>>
>>>> in fact I started with the Solr scripts.
>>>>
>>>>>>>
>>>>
>>>>>>> 2) I’ve deleted all the Solr specific aspects (I think), however there
>>>>
>>>> may still be more to delete.
>>>>
>>>>>>>
>>>>
>>>>>>> 3) This requires a change to how we release Tika, previously we ship
>>>>
>>>> tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we
>>>>
>>>> want to add the tika-server-bin.tgz and tika-server-bin.zip binary
>>>>
>>>> distributions.
>>>>
>>>>>>>
>>>>
>>>>>>> I’m happy to start writing accompanying “how to deploy Tika Server”
>>>>
>>>> docs if this PR looks good! Or, please give input and I’ll make the
>>> updates.
>>>>
>>>>>>>
>>>>
>>>>>>> Eric
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>> On Dec 12, 2019, at 2:39 PM, Eric Pugh <
>>>>
>>>> epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com
>>>>>
>>>>
>>>> wrote:
>>>>
>>>>>>>>
>>>>
>>>>>>>> I’ve created this JIRA to track this work:
>>>>
>>>> https://issues.apache.org/jira/browse/TIKA-3010 <
>>>>
>>>> https://issues.apache.org/jira/browse/TIKA-3010> <
>>>>
>>>> https://issues.apache.org/jira/browse/TIKA-3010 <
>>>>
>>>> https://issues.apache.org/jira/browse/TIKA-3010>>
>>>>
>>>>>>>>
>>>>
>>>>>>>> And a WIP progress PR is at https://github.com/apache/tika/pull/305
>>>>
>>>> <https://github.com/apache/tika/pull/305> <
>>>>
>>>> https://github.com/apache/tika/pull/305 <
>>>>
>>>> https://github.com/apache/tika/pull/305>>
>>>>
>>>>>>>>
>>>>
>>>>>>>> My thought is to put something together that mimics how we deploy
>>>>
>>>> Solr, and see how that works. I have a need for an install process that a
>>>>
>>>> general IT person can follow, who isn’t a Tika expert or a Docker users.
>>>>
>>>>>>>>
>>>>
>>>>>>>>
>>>>
>>>>>>>>
>>>>
>>>>>>>>
>>>>
>>>>>>>>> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org
>>>>
>>>> <ma...@apache.org> <mailto:mattmann@apache.org <mailto:
>>>>
>>>> mattmann@apache.org>>> wrote:
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Thanks for bringing this conversation up Eric.
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Historically if you look over the last 5 years, I think what you
>>>>
>>>> are asking below has sort of already become the de facto
>>>>
>>>>>>>>> truth. Most people are in fact using Tika server, whether they are
>>>>
>>>> individual devs, govvies, commercial folk and the like.
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Big, small and medium projects. Evidenced by the expansion of Tika
>>>>
>>>> APIs into pretty much every PL I know and use of
>>>>
>>>>>>>>> actively today.
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Given that, we probably should update the main website docs to make
>>>>
>>>> this more prominent. The tika server docs on the
>>>>
>>>>>>>>> wiki are pretty darn good. But they don’t get prime real estate.
>>>>
>>>> Would be wonderful if someone wants to update the
>>>>
>>>>>>>>> website to make it more prominent.
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> The downstream Tika Python lib that I maintain has tons of activity
>>>>
>>>> is used by more than 350+ projects and relies solely
>>>>
>>>>>>>>> on Tika-Server. My recommendation to the Solr folks (having created
>>>>
>>>> 7633) from the 2014 DARPA MEMEX days was to
>>>>
>>>>>>>>> move towards Tika Server based SolrCell dep and that’s the right
>>>>
>>>> way to go IMO.
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Chris
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> From: Eric Pugh <epugh@opensourceconnections.com <mailto:
>>>>
>>>> epugh@opensourceconnections.com> <mailto:epugh@opensourceconnections.com
>>>>
>>>> <ma...@opensourceconnections.com>>>
>>>>
>>>>>>>>> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org>
>>>>
>>>> <mailto:dev@tika.apache.org <ma...@tika.apache.org>>" <
>>>>
>>>> dev@tika.apache.org <ma...@tika.apache.org> <mailto:
>>>>
>>>> dev@tika.apache.org <ma...@tika.apache.org>>>
>>>>
>>>>>>>>> Date: Wednesday, December 4, 2019 at 12:24 PM
>>>>
>>>>>>>>> To: "tika-dev@apache.org <ma...@apache.org> <mailto:
>>>>
>>>> tika-dev@apache.org <ma...@apache.org>>" <tika-dev@apache.org
>>>>
>>>> <ma...@apache.org> <mailto:tika-dev@apache.org <mailto:
>>>>
>>>> tika-dev@apache.org>>>
>>>>
>>>>>>>>> Subject: [EXTERNAL] Do we have a community supported approach for
>>>>
>>>> deploying Tika Server in production?
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user
>>>>
>>>> question!
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Over in Solr land there has been renewed discussion about
>>>>
>>>> streamlining what Solr is....
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> In regards to rich content extraction and the Tika project, it
>>>>
>>>> seems like the two ideas that continue to preserve the existing behavior
>>>>
>>>> are:
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> 1) To convert the ExtractingRequestHandler into a Package (Plugin)
>>>>
>>>> for Solr. This slims down the standard Solr download, and *might* make it
>>>>
>>>> easier to update the version of Tika + dependent jars used?
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> 2) The second approach is to instead require Tika-Server to be
>>>>
>>>> running (https://issues.apache.org/jira/browse/SOLR-7633 <
>>>>
>>>> https://issues.apache.org/jira/browse/SOLR-7633><
>>>>
>>>> https://issues.apache.org/jira/browse/SOLR-7633 <
>>>>
>>>> https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr
>>>>
>>>> delegate the call to Tika-Server.
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> I was thinking about why I like option 1 better than 2, and I think
>>>>
>>>> it boils down to how mature the IT organization I am working with is.
>>> Some
>>>>
>>>> IT organizations have large dev-ops teams, and are working at major
>>> scale,
>>>>
>>>> and managing a fleet of Tika-Server on Kubernetes with Load Balancer
>>>>
>>>> dynamically scaling up and down is simple and second nature! However,
>>> many
>>>>
>>>> organizations aren’t like that.
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> So I guess what I’m asking is do we have a reasonable supported
>>>>
>>>> approach for deploying Tika Server for non-tika savvy organizations? I’m
>>>>
>>>> thinking about Solr, and specifically the fact that Solr has a well
>>> defined
>>>>
>>>> set of Service Installation scripts. When I follow the directions in
>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>>>>
>>>> <
>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>>>>
>>>>> <
>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>>>>
>>>> <
>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>>>>>
>>>>
>>>> I can feel confident that when the server is rebooted, then Solr will
>>> come
>>>>
>>>> back up! Plus there is log rotation and all the rest.
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> In contrast, when I look at Tika website, specifically
>>>>
>>>> https://tika.apache.org/1.22/gettingstarted.htm <
>>>>
>>>> https://tika.apache.org/1.22/gettingstarted.htm><
>>>>
>>>> https://tika.apache.org/1.22/gettingstarted.htm <
>>>>
>>>> https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is
>>>>
>>>> to run Tika as a command line application, or embedded in your
>>>>
>>>> application.
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> I’m wondering if Tika-Server needs to be made more prominent, and
>>>>
>>>> treated as the “primary method of interacting with Tika”? Do we need as a
>>>>
>>>> community to focus more on Tika-Server? In our getting started
>>>>
>>>> documentation, in our usage documentation, and in our examples?
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Do we need to create the equivalent of the Service Installation
>>>>
>>>> scripts for Tika-Server?
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Wanted to stoke the discussion!
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Eric
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> _______________________
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
>>>>
>>>> 434.466.1467 | http://www.opensourceconnections.com <
>>>>
>>>> http://www.opensourceconnections.com/><
>>>>
>>>> http://www.opensourceconnections.com/ <
>>>>
>>>> http://www.opensourceconnections.com/>><
>>>>
>>>> http://www.opensourceconnections.com/ <
>>>>
>>>> http://www.opensourceconnections.com/> <
>>>>
>>>> http://www.opensourceconnections.com/ <
>>>>
>>>> http://www.opensourceconnections.com/>>> | My Free/Busy <
>>>>
>>>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <
>>>>
>>>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>
>>>> <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>
>>>>
>>>> <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>
>>>> <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>>>
>>>>
>>>>
>>>>
>>>>>>>>>
>>>>
>>>>>>>>> This e-mail and all contents, including attachments, is considered
>>>>
>>>> to be Company Confidential unless explicitly stated otherwise, regardless
>>>>
>>>> of whether attachments are marked as such.
>>>>
>>>>>>>>
>>>>
>>>>>>>> _______________________
>>>>
>>>>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
>>>>
>>>> 434.466.1467 | http://www.opensourceconnections.com <
>>>>
>>>> http://www.opensourceconnections.com/><
>>>>
>>>> http://www.opensourceconnections.com/ <
>>>>
>>>> http://www.opensourceconnections.com/>> | My Free/Busy <
>>>>
>>>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>>>
>>>>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>
>>>> <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>>
>>>>
>>>>
>>>>
>>>>>>>> This e-mail and all contents, including attachments, is considered
>>>>
>>>> to be Company Confidential unless explicitly stated otherwise, regardless
>>>>
>>>> of whether attachments are marked as such.
>>>>
>>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>> _______________________
>>>>
>>>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>>>>
>>>> | http://www.opensourceconnections.com <
>>>>
>>>> http://www.opensourceconnections.com/><
>>>>
>>>> http://www.opensourceconnections.com/ <
>>>>
>>>> http://www.opensourceconnections.com/>> | My Free/Busy <
>>>>
>>>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>>>
>>>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>
>>>> <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>>
>>>>
>>>>
>>>>
>>>>>>> This e-mail and all contents, including attachments, is considered to
>>>>
>>>> be Company Confidential unless explicitly stated otherwise, regardless of
>>>>
>>>> whether attachments are marked as such.
>>>>
>>>>>>>
>>>>
>>>>>>
>>>>
>>>>>> Spicule Limited is registered in England & Wales. Company Number:
>>>>
>>>> 09954122. Registered office: First Floor, Telecom House, 125-135 Preston
>>>>
>>>> Road, Brighton, England, BN1 6AF. VAT No. 251478891.
>>>>
>>>>>>
>>>>
>>>>>>
>>>>
>>>>>>
>>>>
>>>>>> All engagements are subject to Spicule Terms and Conditions of
>>>>
>>>> Business. This email and its contents are intended solely for the
>>>>
>>>> individual to whom it is addressed and may contain information that is
>>>>
>>>> confidential, privileged or otherwise protected from disclosure,
>>>>
>>>> distributing or copying. Any views or opinions presented in this email
>>> are
>>>>
>>>> solely those of the author and do not necessarily represent those of
>>>>
>>>> Spicule Limited. The company accepts no liability for any damage caused
>>> by
>>>>
>>>> any virus transmitted by this email. If you have received this message in
>>>>
>>>> error, please notify us immediately by reply email before deleting it
>>> from
>>>>
>>>> your system. Service of legal notice cannot be effected on Spicule
>>> Limited
>>>>
>>>> by email.
>>>>
>>>>>>
>>>>
>>>>>
>>>>
>>>>> _______________________
>>>>
>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>>>>
>>>> http://www.opensourceconnections.com <
>>>>
>>>> http://www.opensourceconnections.com/> | My Free/Busy <
>>>>
>>>> http://tinyurl.com/eric-cal>
>>>>
>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>
>>>>
>>>>
>>>>
>>>>> This e-mail and all contents, including attachments, is considered to be
>>>>
>>>> Company Confidential unless explicitly stated otherwise, regardless of
>>>>
>>>> whether attachments are marked as such.
>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________
>>>>
>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>>>>
>>>> http://www.opensourceconnections.com <
>>>>
>>>> http://www.opensourceconnections.com/> | My Free/Busy <
>>>>
>>>> http://tinyurl.com/eric-cal>
>>>>
>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>>
>>>>
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>
>>>>
>>>>
>>>>
>>>> This e-mail and all contents, including attachments, is considered to be
>>>>
>>>> Company Confidential unless explicitly stated otherwise, regardless of
>>>>
>>>> whether attachments are marked as such.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> _______________________
>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>>> http://www.opensourceconnections.com <
>>> http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>> | My Free/Busy <
>>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>>>
>>> This e-mail and all contents, including attachments, is considered to be
>>> Company Confidential unless explicitly stated otherwise, regardless of
>>> whether attachments are marked as such.
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Eric Pugh <ep...@opensourceconnections.com>.
Great!
> On Feb 5, 2020, at 10:55 PM, David Meikle <da...@meikle.io> wrote:
>
> Hi Eric,
>
> +1 - I think we should drop that and rely on tika-docker instead.
>
> I'm about to push more to it tonight, and then we could include it as a
> sub-module in Tika to do regular development snapshots too.
>
> Cheers,
> Dave
>
> On Wed, 5 Feb 2020 at 15:34, Eric Pugh <epugh@opensourceconnections.com <ma...@opensourceconnections.com>>
> wrote:
>
>> Following this thread, should we deprecate/remove the Tika Docker support
>> that is in Tika-server project?
>>
>> The `mvn dockerfile:build` command now relies on a plugin that is no
>> longer supported according to https://github.com/spotify/dockerfile-maven,
>> and it seems like the Tika-docker project is really the right place for
>> this!
>>
>> I’m thinking that this might help reduce the footprint of things we need
>> to support.
>>
>>
>>
>>
>>
>>
>>
>>
>>> On Jan 9, 2020, at 12:08 AM, Chris Mattmann <ma...@apache.org> wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> Note there is also a USC tika dockers repo where I put the data science
>> stuff too:
>>>
>>>
>>>
>>> http://github.com/USCDataScience/tika-dockers
>>>
>>>
>>>
>>> I’ll continue to push DL and ML Tika stuff there.
>>>
>>> Cheers,
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Dave Meikle <dm...@apache.org>
>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>> Date: Wednesday, January 8, 2020 at 2:18 PM
>>> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
>>> Subject: Re: [EXTERNAL] Do we have a community supported approach for
>> deploying Tika Server in production?
>>>
>>>
>>>
>>> Hi Eric,
>>>
>>>
>>>
>>> Will take a look. On a related note, I've created a new repos:
>>>
>>> https://github.com/apache/tika-docker
>>>
>>>
>>>
>>> Thinking based on looking at the PRs and Issues on LogicalSpark
>>>
>>> docker-tikaserver, I'll create an updated docker file using what you've
>>>
>>> added here and look to publish builds to docker hub from that.
>>>
>>>
>>>
>>> What do you think?
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Dave
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 8 Jan 2020 at 03:16, Eric Pugh <ep...@opensourceconnections.com>
>>>
>>> wrote:
>>>
>>>
>>>
>>> Hi all, I’ve gone ahead and added the -spawnChild property as a default
>>>
>>> when running Tika Server as a service. I’d love some eyes on the PR,
>> and
>>>
>>> if this looks good, get it committed.
>>>
>>>
>>>
>>> Feedback welcome!
>>>
>>>
>>>
>>> Eric
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On Dec 17, 2019, at 12:53 PM, Eric Pugh <
>> epugh@opensourceconnections.com>
>>>
>>> wrote:
>>>
>>>>
>>>
>>>> Cool.
>>>
>>>>
>>>
>>>> It’s the auto run that I really need, and the other part that I don’t
>>>
>>> think I’ve tackled properly is the managing of logs…
>>>
>>>>
>>>
>>>> I’m going to check with my project to see if they support Snap packages.
>>>
>>>>
>>>
>>>> Eric
>>>
>>>>
>>>
>>>>
>>>
>>>>> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk <mailto:
>>>
>>> tom@spicule.co.uk>> wrote:
>>>
>>>>>
>>>
>>>>> Just saw this fly by and FYI on Linux systems that support Snap
>>>
>>> packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install
>> tika-server`
>>>
>>> doesn’t yet auto-run I don’t believe but you can just run
>> `tika-server.run`
>>>
>>> and adding an init script wouldn’t take 5 minutes.
>>>
>>>>>
>>>
>>>>> Tom
>>>
>>>>>
>>>
>>>>> On 16 December 2019 at 18:42:55, Eric Pugh (
>>>
>>> epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com
>>> )
>>>
>>> wrote:
>>>
>>>>>
>>>
>>>>>> Hi folks!
>>>
>>>>>>
>>>
>>>>>> I’ve got a mostly completed PR for having install scripts for Tika
>>>
>>> Server, and I’m hoping a committer will take a look at the PR, and give
>>>
>>> feedback (and ideally commit in time for 1.24!)
>>>
>>>>>>
>>>
>>>>>> A couple of things:
>>>
>>>>>>
>>>
>>>>>> 1) This was completely influenced by
>>>
>>>
>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>>
>>> <
>>>
>>>
>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>>
>>>> <
>>>
>>>
>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>>
>>> <
>>>
>>>
>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>>> ,
>>>
>>> in fact I started with the Solr scripts.
>>>
>>>>>>
>>>
>>>>>> 2) I’ve deleted all the Solr specific aspects (I think), however there
>>>
>>> may still be more to delete.
>>>
>>>>>>
>>>
>>>>>> 3) This requires a change to how we release Tika, previously we ship
>>>
>>> tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we
>>>
>>> want to add the tika-server-bin.tgz and tika-server-bin.zip binary
>>>
>>> distributions.
>>>
>>>>>>
>>>
>>>>>> I’m happy to start writing accompanying “how to deploy Tika Server”
>>>
>>> docs if this PR looks good! Or, please give input and I’ll make the
>> updates.
>>>
>>>>>>
>>>
>>>>>> Eric
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>> On Dec 12, 2019, at 2:39 PM, Eric Pugh <
>>>
>>> epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com
>>>>
>>>
>>> wrote:
>>>
>>>>>>>
>>>
>>>>>>> I’ve created this JIRA to track this work:
>>>
>>> https://issues.apache.org/jira/browse/TIKA-3010 <
>>>
>>> https://issues.apache.org/jira/browse/TIKA-3010> <
>>>
>>> https://issues.apache.org/jira/browse/TIKA-3010 <
>>>
>>> https://issues.apache.org/jira/browse/TIKA-3010>>
>>>
>>>>>>>
>>>
>>>>>>> And a WIP progress PR is at https://github.com/apache/tika/pull/305
>>>
>>> <https://github.com/apache/tika/pull/305> <
>>>
>>> https://github.com/apache/tika/pull/305 <
>>>
>>> https://github.com/apache/tika/pull/305>>
>>>
>>>>>>>
>>>
>>>>>>> My thought is to put something together that mimics how we deploy
>>>
>>> Solr, and see how that works. I have a need for an install process that a
>>>
>>> general IT person can follow, who isn’t a Tika expert or a Docker users.
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org
>>>
>>> <ma...@apache.org> <mailto:mattmann@apache.org <mailto:
>>>
>>> mattmann@apache.org>>> wrote:
>>>
>>>>>>>>
>>>
>>>>>>>> Thanks for bringing this conversation up Eric.
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> Historically if you look over the last 5 years, I think what you
>>>
>>> are asking below has sort of already become the de facto
>>>
>>>>>>>> truth. Most people are in fact using Tika server, whether they are
>>>
>>> individual devs, govvies, commercial folk and the like.
>>>
>>>>>>>>
>>>
>>>>>>>> Big, small and medium projects. Evidenced by the expansion of Tika
>>>
>>> APIs into pretty much every PL I know and use of
>>>
>>>>>>>> actively today.
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> Given that, we probably should update the main website docs to make
>>>
>>> this more prominent. The tika server docs on the
>>>
>>>>>>>> wiki are pretty darn good. But they don’t get prime real estate.
>>>
>>> Would be wonderful if someone wants to update the
>>>
>>>>>>>> website to make it more prominent.
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> The downstream Tika Python lib that I maintain has tons of activity
>>>
>>> is used by more than 350+ projects and relies solely
>>>
>>>>>>>> on Tika-Server. My recommendation to the Solr folks (having created
>>>
>>> 7633) from the 2014 DARPA MEMEX days was to
>>>
>>>>>>>> move towards Tika Server based SolrCell dep and that’s the right
>>>
>>> way to go IMO.
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> Chris
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> From: Eric Pugh <epugh@opensourceconnections.com <mailto:
>>>
>>> epugh@opensourceconnections.com> <mailto:epugh@opensourceconnections.com
>>>
>>> <ma...@opensourceconnections.com>>>
>>>
>>>>>>>> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org>
>>>
>>> <mailto:dev@tika.apache.org <ma...@tika.apache.org>>" <
>>>
>>> dev@tika.apache.org <ma...@tika.apache.org> <mailto:
>>>
>>> dev@tika.apache.org <ma...@tika.apache.org>>>
>>>
>>>>>>>> Date: Wednesday, December 4, 2019 at 12:24 PM
>>>
>>>>>>>> To: "tika-dev@apache.org <ma...@apache.org> <mailto:
>>>
>>> tika-dev@apache.org <ma...@apache.org>>" <tika-dev@apache.org
>>>
>>> <ma...@apache.org> <mailto:tika-dev@apache.org <mailto:
>>>
>>> tika-dev@apache.org>>>
>>>
>>>>>>>> Subject: [EXTERNAL] Do we have a community supported approach for
>>>
>>> deploying Tika Server in production?
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user
>>>
>>> question!
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> Over in Solr land there has been renewed discussion about
>>>
>>> streamlining what Solr is....
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> In regards to rich content extraction and the Tika project, it
>>>
>>> seems like the two ideas that continue to preserve the existing behavior
>>>
>>> are:
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> 1) To convert the ExtractingRequestHandler into a Package (Plugin)
>>>
>>> for Solr. This slims down the standard Solr download, and *might* make it
>>>
>>> easier to update the version of Tika + dependent jars used?
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> 2) The second approach is to instead require Tika-Server to be
>>>
>>> running (https://issues.apache.org/jira/browse/SOLR-7633 <
>>>
>>> https://issues.apache.org/jira/browse/SOLR-7633><
>>>
>>> https://issues.apache.org/jira/browse/SOLR-7633 <
>>>
>>> https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr
>>>
>>> delegate the call to Tika-Server.
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> I was thinking about why I like option 1 better than 2, and I think
>>>
>>> it boils down to how mature the IT organization I am working with is.
>> Some
>>>
>>> IT organizations have large dev-ops teams, and are working at major
>> scale,
>>>
>>> and managing a fleet of Tika-Server on Kubernetes with Load Balancer
>>>
>>> dynamically scaling up and down is simple and second nature! However,
>> many
>>>
>>> organizations aren’t like that.
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> So I guess what I’m asking is do we have a reasonable supported
>>>
>>> approach for deploying Tika Server for non-tika savvy organizations? I’m
>>>
>>> thinking about Solr, and specifically the fact that Solr has a well
>> defined
>>>
>>> set of Service Installation scripts. When I follow the directions in
>>>
>>>
>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>>>
>>> <
>>>
>>>
>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>>>
>>>> <
>>>
>>>
>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>>>
>>> <
>>>
>>>
>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>>>>
>>>
>>> I can feel confident that when the server is rebooted, then Solr will
>> come
>>>
>>> back up! Plus there is log rotation and all the rest.
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> In contrast, when I look at Tika website, specifically
>>>
>>> https://tika.apache.org/1.22/gettingstarted.htm <
>>>
>>> https://tika.apache.org/1.22/gettingstarted.htm><
>>>
>>> https://tika.apache.org/1.22/gettingstarted.htm <
>>>
>>> https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is
>>>
>>> to run Tika as a command line application, or embedded in your
>>>
>>> application.
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> I’m wondering if Tika-Server needs to be made more prominent, and
>>>
>>> treated as the “primary method of interacting with Tika”? Do we need as a
>>>
>>> community to focus more on Tika-Server? In our getting started
>>>
>>> documentation, in our usage documentation, and in our examples?
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> Do we need to create the equivalent of the Service Installation
>>>
>>> scripts for Tika-Server?
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> Wanted to stoke the discussion!
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> Eric
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>> _______________________
>>>
>>>>>>>>
>>>
>>>>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
>>>
>>> 434.466.1467 | http://www.opensourceconnections.com <
>>>
>>> http://www.opensourceconnections.com/><
>>>
>>> http://www.opensourceconnections.com/ <
>>>
>>> http://www.opensourceconnections.com/>><
>>>
>>> http://www.opensourceconnections.com/ <
>>>
>>> http://www.opensourceconnections.com/> <
>>>
>>> http://www.opensourceconnections.com/ <
>>>
>>> http://www.opensourceconnections.com/>>> | My Free/Busy <
>>>
>>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <
>>>
>>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>
>>>
>>>>>>>>
>>>
>>>>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>
>>> <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>
>>>
>>> <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>
>>> <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>>
>>>
>>>
>>>
>>>>>>>>
>>>
>>>>>>>> This e-mail and all contents, including attachments, is considered
>>>
>>> to be Company Confidential unless explicitly stated otherwise, regardless
>>>
>>> of whether attachments are marked as such.
>>>
>>>>>>>
>>>
>>>>>>> _______________________
>>>
>>>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
>>>
>>> 434.466.1467 | http://www.opensourceconnections.com <
>>>
>>> http://www.opensourceconnections.com/><
>>>
>>> http://www.opensourceconnections.com/ <
>>>
>>> http://www.opensourceconnections.com/>> | My Free/Busy <
>>>
>>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>>
>>>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>
>>> <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>
>>>
>>>
>>>
>>>>>>> This e-mail and all contents, including attachments, is considered
>>>
>>> to be Company Confidential unless explicitly stated otherwise, regardless
>>>
>>> of whether attachments are marked as such.
>>>
>>>>>>>
>>>
>>>>>>
>>>
>>>>>> _______________________
>>>
>>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>>>
>>> | http://www.opensourceconnections.com <
>>>
>>> http://www.opensourceconnections.com/><
>>>
>>> http://www.opensourceconnections.com/ <
>>>
>>> http://www.opensourceconnections.com/>> | My Free/Busy <
>>>
>>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>>
>>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>
>>> <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>>
>>>
>>>
>>>
>>>>>> This e-mail and all contents, including attachments, is considered to
>>>
>>> be Company Confidential unless explicitly stated otherwise, regardless of
>>>
>>> whether attachments are marked as such.
>>>
>>>>>>
>>>
>>>>>
>>>
>>>>> Spicule Limited is registered in England & Wales. Company Number:
>>>
>>> 09954122. Registered office: First Floor, Telecom House, 125-135 Preston
>>>
>>> Road, Brighton, England, BN1 6AF. VAT No. 251478891.
>>>
>>>>>
>>>
>>>>>
>>>
>>>>>
>>>
>>>>> All engagements are subject to Spicule Terms and Conditions of
>>>
>>> Business. This email and its contents are intended solely for the
>>>
>>> individual to whom it is addressed and may contain information that is
>>>
>>> confidential, privileged or otherwise protected from disclosure,
>>>
>>> distributing or copying. Any views or opinions presented in this email
>> are
>>>
>>> solely those of the author and do not necessarily represent those of
>>>
>>> Spicule Limited. The company accepts no liability for any damage caused
>> by
>>>
>>> any virus transmitted by this email. If you have received this message in
>>>
>>> error, please notify us immediately by reply email before deleting it
>> from
>>>
>>> your system. Service of legal notice cannot be effected on Spicule
>> Limited
>>>
>>> by email.
>>>
>>>>>
>>>
>>>>
>>>
>>>> _______________________
>>>
>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>>>
>>> http://www.opensourceconnections.com <
>>>
>>> http://www.opensourceconnections.com/> | My Free/Busy <
>>>
>>> http://tinyurl.com/eric-cal>
>>>
>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>
>>>
>>>
>>>
>>>> This e-mail and all contents, including attachments, is considered to be
>>>
>>> Company Confidential unless explicitly stated otherwise, regardless of
>>>
>>> whether attachments are marked as such.
>>>
>>>>
>>>
>>>
>>>
>>> _______________________
>>>
>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>>>
>>> http://www.opensourceconnections.com <
>>>
>>> http://www.opensourceconnections.com/> | My Free/Busy <
>>>
>>> http://tinyurl.com/eric-cal>
>>>
>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>
>>>
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>
>>>
>>>
>>>
>>> This e-mail and all contents, including attachments, is considered to be
>>>
>>> Company Confidential unless explicitly stated otherwise, regardless of
>>>
>>> whether attachments are marked as such.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> _______________________
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>> http://www.opensourceconnections.com <
>> http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>> | My Free/Busy <
>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>>
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless of
>> whether attachments are marked as such.
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by David Meikle <da...@meikle.io>.
Hi Eric,
+1 - I think we should drop that and rely on tika-docker instead.
I'm about to push more to it tonight, and then we could include it as a
sub-module in Tika to do regular development snapshots too.
Cheers,
Dave
On Wed, 5 Feb 2020 at 15:34, Eric Pugh <ep...@opensourceconnections.com>
wrote:
> Following this thread, should we deprecate/remove the Tika Docker support
> that is in Tika-server project?
>
> The `mvn dockerfile:build` command now relies on a plugin that is no
> longer supported according to https://github.com/spotify/dockerfile-maven,
> and it seems like the Tika-docker project is really the right place for
> this!
>
> I’m thinking that this might help reduce the footprint of things we need
> to support.
>
>
>
>
>
>
>
>
> > On Jan 9, 2020, at 12:08 AM, Chris Mattmann <ma...@apache.org> wrote:
> >
> > +1
> >
> >
> >
> > Note there is also a USC tika dockers repo where I put the data science
> stuff too:
> >
> >
> >
> > http://github.com/USCDataScience/tika-dockers
> >
> >
> >
> > I’ll continue to push DL and ML Tika stuff there.
> >
> > Cheers,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Dave Meikle <dm...@apache.org>
> > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> > Date: Wednesday, January 8, 2020 at 2:18 PM
> > To: "<de...@tika.apache.org>" <de...@tika.apache.org>
> > Subject: Re: [EXTERNAL] Do we have a community supported approach for
> deploying Tika Server in production?
> >
> >
> >
> > Hi Eric,
> >
> >
> >
> > Will take a look. On a related note, I've created a new repos:
> >
> > https://github.com/apache/tika-docker
> >
> >
> >
> > Thinking based on looking at the PRs and Issues on LogicalSpark
> >
> > docker-tikaserver, I'll create an updated docker file using what you've
> >
> > added here and look to publish builds to docker hub from that.
> >
> >
> >
> > What do you think?
> >
> >
> >
> > Cheers,
> >
> > Dave
> >
> >
> >
> >
> >
> >
> >
> > On Wed, 8 Jan 2020 at 03:16, Eric Pugh <ep...@opensourceconnections.com>
> >
> > wrote:
> >
> >
> >
> > Hi all, I’ve gone ahead and added the -spawnChild property as a default
> >
> > when running Tika Server as a service. I’d love some eyes on the PR,
> and
> >
> > if this looks good, get it committed.
> >
> >
> >
> > Feedback welcome!
> >
> >
> >
> > Eric
> >
> >
> >
> >
> >
> >
> >
> >> On Dec 17, 2019, at 12:53 PM, Eric Pugh <
> epugh@opensourceconnections.com>
> >
> > wrote:
> >
> >>
> >
> >> Cool.
> >
> >>
> >
> >> It’s the auto run that I really need, and the other part that I don’t
> >
> > think I’ve tackled properly is the managing of logs…
> >
> >>
> >
> >> I’m going to check with my project to see if they support Snap packages.
> >
> >>
> >
> >> Eric
> >
> >>
> >
> >>
> >
> >>> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk <mailto:
> >
> > tom@spicule.co.uk>> wrote:
> >
> >>>
> >
> >>> Just saw this fly by and FYI on Linux systems that support Snap
> >
> > packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install
> tika-server`
> >
> > doesn’t yet auto-run I don’t believe but you can just run
> `tika-server.run`
> >
> > and adding an init script wouldn’t take 5 minutes.
> >
> >>>
> >
> >>> Tom
> >
> >>>
> >
> >>> On 16 December 2019 at 18:42:55, Eric Pugh (
> >
> > epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com
> >)
> >
> > wrote:
> >
> >>>
> >
> >>>> Hi folks!
> >
> >>>>
> >
> >>>> I’ve got a mostly completed PR for having install scripts for Tika
> >
> > Server, and I’m hoping a committer will take a look at the PR, and give
> >
> > feedback (and ideally commit in time for 1.24!)
> >
> >>>>
> >
> >>>> A couple of things:
> >
> >>>>
> >
> >>>> 1) This was completely influenced by
> >
> >
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> >
> > <
> >
> >
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> >
> >> <
> >
> >
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> >
> > <
> >
> >
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> >>,
> >
> > in fact I started with the Solr scripts.
> >
> >>>>
> >
> >>>> 2) I’ve deleted all the Solr specific aspects (I think), however there
> >
> > may still be more to delete.
> >
> >>>>
> >
> >>>> 3) This requires a change to how we release Tika, previously we ship
> >
> > tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we
> >
> > want to add the tika-server-bin.tgz and tika-server-bin.zip binary
> >
> > distributions.
> >
> >>>>
> >
> >>>> I’m happy to start writing accompanying “how to deploy Tika Server”
> >
> > docs if this PR looks good! Or, please give input and I’ll make the
> updates.
> >
> >>>>
> >
> >>>> Eric
> >
> >>>>
> >
> >>>>
> >
> >>>>> On Dec 12, 2019, at 2:39 PM, Eric Pugh <
> >
> > epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com
> >>
> >
> > wrote:
> >
> >>>>>
> >
> >>>>> I’ve created this JIRA to track this work:
> >
> > https://issues.apache.org/jira/browse/TIKA-3010 <
> >
> > https://issues.apache.org/jira/browse/TIKA-3010> <
> >
> > https://issues.apache.org/jira/browse/TIKA-3010 <
> >
> > https://issues.apache.org/jira/browse/TIKA-3010>>
> >
> >>>>>
> >
> >>>>> And a WIP progress PR is at https://github.com/apache/tika/pull/305
> >
> > <https://github.com/apache/tika/pull/305> <
> >
> > https://github.com/apache/tika/pull/305 <
> >
> > https://github.com/apache/tika/pull/305>>
> >
> >>>>>
> >
> >>>>> My thought is to put something together that mimics how we deploy
> >
> > Solr, and see how that works. I have a need for an install process that a
> >
> > general IT person can follow, who isn’t a Tika expert or a Docker users.
> >
> >>>>>
> >
> >>>>>
> >
> >>>>>
> >
> >>>>>
> >
> >>>>>> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org
> >
> > <ma...@apache.org> <mailto:mattmann@apache.org <mailto:
> >
> > mattmann@apache.org>>> wrote:
> >
> >>>>>>
> >
> >>>>>> Thanks for bringing this conversation up Eric.
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> Historically if you look over the last 5 years, I think what you
> >
> > are asking below has sort of already become the de facto
> >
> >>>>>> truth. Most people are in fact using Tika server, whether they are
> >
> > individual devs, govvies, commercial folk and the like.
> >
> >>>>>>
> >
> >>>>>> Big, small and medium projects. Evidenced by the expansion of Tika
> >
> > APIs into pretty much every PL I know and use of
> >
> >>>>>> actively today.
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> Given that, we probably should update the main website docs to make
> >
> > this more prominent. The tika server docs on the
> >
> >>>>>> wiki are pretty darn good. But they don’t get prime real estate.
> >
> > Would be wonderful if someone wants to update the
> >
> >>>>>> website to make it more prominent.
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> The downstream Tika Python lib that I maintain has tons of activity
> >
> > is used by more than 350+ projects and relies solely
> >
> >>>>>> on Tika-Server. My recommendation to the Solr folks (having created
> >
> > 7633) from the 2014 DARPA MEMEX days was to
> >
> >>>>>> move towards Tika Server based SolrCell dep and that’s the right
> >
> > way to go IMO.
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> Chris
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> From: Eric Pugh <epugh@opensourceconnections.com <mailto:
> >
> > epugh@opensourceconnections.com> <mailto:epugh@opensourceconnections.com
> >
> > <ma...@opensourceconnections.com>>>
> >
> >>>>>> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org>
> >
> > <mailto:dev@tika.apache.org <ma...@tika.apache.org>>" <
> >
> > dev@tika.apache.org <ma...@tika.apache.org> <mailto:
> >
> > dev@tika.apache.org <ma...@tika.apache.org>>>
> >
> >>>>>> Date: Wednesday, December 4, 2019 at 12:24 PM
> >
> >>>>>> To: "tika-dev@apache.org <ma...@apache.org> <mailto:
> >
> > tika-dev@apache.org <ma...@apache.org>>" <tika-dev@apache.org
> >
> > <ma...@apache.org> <mailto:tika-dev@apache.org <mailto:
> >
> > tika-dev@apache.org>>>
> >
> >>>>>> Subject: [EXTERNAL] Do we have a community supported approach for
> >
> > deploying Tika Server in production?
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user
> >
> > question!
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> Over in Solr land there has been renewed discussion about
> >
> > streamlining what Solr is....
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> In regards to rich content extraction and the Tika project, it
> >
> > seems like the two ideas that continue to preserve the existing behavior
> >
> > are:
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> 1) To convert the ExtractingRequestHandler into a Package (Plugin)
> >
> > for Solr. This slims down the standard Solr download, and *might* make it
> >
> > easier to update the version of Tika + dependent jars used?
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> 2) The second approach is to instead require Tika-Server to be
> >
> > running (https://issues.apache.org/jira/browse/SOLR-7633 <
> >
> > https://issues.apache.org/jira/browse/SOLR-7633><
> >
> > https://issues.apache.org/jira/browse/SOLR-7633 <
> >
> > https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr
> >
> > delegate the call to Tika-Server.
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> I was thinking about why I like option 1 better than 2, and I think
> >
> > it boils down to how mature the IT organization I am working with is.
> Some
> >
> > IT organizations have large dev-ops teams, and are working at major
> scale,
> >
> > and managing a fleet of Tika-Server on Kubernetes with Load Balancer
> >
> > dynamically scaling up and down is simple and second nature! However,
> many
> >
> > organizations aren’t like that.
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> So I guess what I’m asking is do we have a reasonable supported
> >
> > approach for deploying Tika Server for non-tika savvy organizations? I’m
> >
> > thinking about Solr, and specifically the fact that Solr has a well
> defined
> >
> > set of Service Installation scripts. When I follow the directions in
> >
> >
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> >
> > <
> >
> >
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> >
> >> <
> >
> >
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> >
> > <
> >
> >
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> >>
> >
> > I can feel confident that when the server is rebooted, then Solr will
> come
> >
> > back up! Plus there is log rotation and all the rest.
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> In contrast, when I look at Tika website, specifically
> >
> > https://tika.apache.org/1.22/gettingstarted.htm <
> >
> > https://tika.apache.org/1.22/gettingstarted.htm><
> >
> > https://tika.apache.org/1.22/gettingstarted.htm <
> >
> > https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is
> >
> > to run Tika as a command line application, or embedded in your
> >
> > application.
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> I’m wondering if Tika-Server needs to be made more prominent, and
> >
> > treated as the “primary method of interacting with Tika”? Do we need as a
> >
> > community to focus more on Tika-Server? In our getting started
> >
> > documentation, in our usage documentation, and in our examples?
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> Do we need to create the equivalent of the Service Installation
> >
> > scripts for Tika-Server?
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> Wanted to stoke the discussion!
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> Eric
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>>
> >
> >>>>>> _______________________
> >
> >>>>>>
> >
> >>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
> >
> > 434.466.1467 | http://www.opensourceconnections.com <
> >
> > http://www.opensourceconnections.com/><
> >
> > http://www.opensourceconnections.com/ <
> >
> > http://www.opensourceconnections.com/>><
> >
> > http://www.opensourceconnections.com/ <
> >
> > http://www.opensourceconnections.com/> <
> >
> > http://www.opensourceconnections.com/ <
> >
> > http://www.opensourceconnections.com/>>> | My Free/Busy <
> >
> > http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <
> >
> > http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>
> >
> >>>>>>
> >
> >>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> > <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> >
> > <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> > <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >>>
> >
> >
> >
> >>>>>>
> >
> >>>>>> This e-mail and all contents, including attachments, is considered
> >
> > to be Company Confidential unless explicitly stated otherwise, regardless
> >
> > of whether attachments are marked as such.
> >
> >>>>>
> >
> >>>>> _______________________
> >
> >>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
> >
> > 434.466.1467 | http://www.opensourceconnections.com <
> >
> > http://www.opensourceconnections.com/><
> >
> > http://www.opensourceconnections.com/ <
> >
> > http://www.opensourceconnections.com/>> | My Free/Busy <
> >
> > http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
> >
> >>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> > <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >>
> >
> >
> >
> >>>>> This e-mail and all contents, including attachments, is considered
> >
> > to be Company Confidential unless explicitly stated otherwise, regardless
> >
> > of whether attachments are marked as such.
> >
> >>>>>
> >
> >>>>
> >
> >>>> _______________________
> >
> >>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467
> >
> > | http://www.opensourceconnections.com <
> >
> > http://www.opensourceconnections.com/><
> >
> > http://www.opensourceconnections.com/ <
> >
> > http://www.opensourceconnections.com/>> | My Free/Busy <
> >
> > http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
> >
> >>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> > <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >>
> >
> >
> >
> >>>> This e-mail and all contents, including attachments, is considered to
> >
> > be Company Confidential unless explicitly stated otherwise, regardless of
> >
> > whether attachments are marked as such.
> >
> >>>>
> >
> >>>
> >
> >>> Spicule Limited is registered in England & Wales. Company Number:
> >
> > 09954122. Registered office: First Floor, Telecom House, 125-135 Preston
> >
> > Road, Brighton, England, BN1 6AF. VAT No. 251478891.
> >
> >>>
> >
> >>>
> >
> >>>
> >
> >>> All engagements are subject to Spicule Terms and Conditions of
> >
> > Business. This email and its contents are intended solely for the
> >
> > individual to whom it is addressed and may contain information that is
> >
> > confidential, privileged or otherwise protected from disclosure,
> >
> > distributing or copying. Any views or opinions presented in this email
> are
> >
> > solely those of the author and do not necessarily represent those of
> >
> > Spicule Limited. The company accepts no liability for any damage caused
> by
> >
> > any virus transmitted by this email. If you have received this message in
> >
> > error, please notify us immediately by reply email before deleting it
> from
> >
> > your system. Service of legal notice cannot be effected on Spicule
> Limited
> >
> > by email.
> >
> >>>
> >
> >>
> >
> >> _______________________
> >
> >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> >
> > http://www.opensourceconnections.com <
> >
> > http://www.opensourceconnections.com/> | My Free/Busy <
> >
> > http://tinyurl.com/eric-cal>
> >
> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> >
> >
> >
> >> This e-mail and all contents, including attachments, is considered to be
> >
> > Company Confidential unless explicitly stated otherwise, regardless of
> >
> > whether attachments are marked as such.
> >
> >>
> >
> >
> >
> > _______________________
> >
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> >
> > http://www.opensourceconnections.com <
> >
> > http://www.opensourceconnections.com/> | My Free/Busy <
> >
> > http://tinyurl.com/eric-cal>
> >
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> >
> >
> >
> > This e-mail and all contents, including attachments, is considered to be
> >
> > Company Confidential unless explicitly stated otherwise, regardless of
> >
> > whether attachments are marked as such.
> >
> >
> >
> >
> >
> >
> >
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Eric Pugh <ep...@opensourceconnections.com>.
Following this thread, should we deprecate/remove the Tika Docker support that is in Tika-server project?
The `mvn dockerfile:build` command now relies on a plugin that is no longer supported according to https://github.com/spotify/dockerfile-maven, and it seems like the Tika-docker project is really the right place for this!
I’m thinking that this might help reduce the footprint of things we need to support.
> On Jan 9, 2020, at 12:08 AM, Chris Mattmann <ma...@apache.org> wrote:
>
> +1
>
>
>
> Note there is also a USC tika dockers repo where I put the data science stuff too:
>
>
>
> http://github.com/USCDataScience/tika-dockers
>
>
>
> I’ll continue to push DL and ML Tika stuff there.
>
> Cheers,
>
> Chris
>
>
>
>
>
>
>
>
>
> From: Dave Meikle <dm...@apache.org>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Wednesday, January 8, 2020 at 2:18 PM
> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
> Subject: Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
>
>
>
> Hi Eric,
>
>
>
> Will take a look. On a related note, I've created a new repos:
>
> https://github.com/apache/tika-docker
>
>
>
> Thinking based on looking at the PRs and Issues on LogicalSpark
>
> docker-tikaserver, I'll create an updated docker file using what you've
>
> added here and look to publish builds to docker hub from that.
>
>
>
> What do you think?
>
>
>
> Cheers,
>
> Dave
>
>
>
>
>
>
>
> On Wed, 8 Jan 2020 at 03:16, Eric Pugh <ep...@opensourceconnections.com>
>
> wrote:
>
>
>
> Hi all, I’ve gone ahead and added the -spawnChild property as a default
>
> when running Tika Server as a service. I’d love some eyes on the PR, and
>
> if this looks good, get it committed.
>
>
>
> Feedback welcome!
>
>
>
> Eric
>
>
>
>
>
>
>
>> On Dec 17, 2019, at 12:53 PM, Eric Pugh <ep...@opensourceconnections.com>
>
> wrote:
>
>>
>
>> Cool.
>
>>
>
>> It’s the auto run that I really need, and the other part that I don’t
>
> think I’ve tackled properly is the managing of logs…
>
>>
>
>> I’m going to check with my project to see if they support Snap packages.
>
>>
>
>> Eric
>
>>
>
>>
>
>>> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk <mailto:
>
> tom@spicule.co.uk>> wrote:
>
>>>
>
>>> Just saw this fly by and FYI on Linux systems that support Snap
>
> packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server`
>
> doesn’t yet auto-run I don’t believe but you can just run `tika-server.run`
>
> and adding an init script wouldn’t take 5 minutes.
>
>>>
>
>>> Tom
>
>>>
>
>>> On 16 December 2019 at 18:42:55, Eric Pugh (
>
> epugh@opensourceconnections.com <ma...@opensourceconnections.com>)
>
> wrote:
>
>>>
>
>>>> Hi folks!
>
>>>>
>
>>>> I’ve got a mostly completed PR for having install scripts for Tika
>
> Server, and I’m hoping a committer will take a look at the PR, and give
>
> feedback (and ideally commit in time for 1.24!)
>
>>>>
>
>>>> A couple of things:
>
>>>>
>
>>>> 1) This was completely influenced by
>
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>
> <
>
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>
>> <
>
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>
> <
>
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,
>
> in fact I started with the Solr scripts.
>
>>>>
>
>>>> 2) I’ve deleted all the Solr specific aspects (I think), however there
>
> may still be more to delete.
>
>>>>
>
>>>> 3) This requires a change to how we release Tika, previously we ship
>
> tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we
>
> want to add the tika-server-bin.tgz and tika-server-bin.zip binary
>
> distributions.
>
>>>>
>
>>>> I’m happy to start writing accompanying “how to deploy Tika Server”
>
> docs if this PR looks good! Or, please give input and I’ll make the updates.
>
>>>>
>
>>>> Eric
>
>>>>
>
>>>>
>
>>>>> On Dec 12, 2019, at 2:39 PM, Eric Pugh <
>
> epugh@opensourceconnections.com <ma...@opensourceconnections.com>>
>
> wrote:
>
>>>>>
>
>>>>> I’ve created this JIRA to track this work:
>
> https://issues.apache.org/jira/browse/TIKA-3010 <
>
> https://issues.apache.org/jira/browse/TIKA-3010> <
>
> https://issues.apache.org/jira/browse/TIKA-3010 <
>
> https://issues.apache.org/jira/browse/TIKA-3010>>
>
>>>>>
>
>>>>> And a WIP progress PR is at https://github.com/apache/tika/pull/305
>
> <https://github.com/apache/tika/pull/305> <
>
> https://github.com/apache/tika/pull/305 <
>
> https://github.com/apache/tika/pull/305>>
>
>>>>>
>
>>>>> My thought is to put something together that mimics how we deploy
>
> Solr, and see how that works. I have a need for an install process that a
>
> general IT person can follow, who isn’t a Tika expert or a Docker users.
>
>>>>>
>
>>>>>
>
>>>>>
>
>>>>>
>
>>>>>> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org
>
> <ma...@apache.org> <mailto:mattmann@apache.org <mailto:
>
> mattmann@apache.org>>> wrote:
>
>>>>>>
>
>>>>>> Thanks for bringing this conversation up Eric.
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> Historically if you look over the last 5 years, I think what you
>
> are asking below has sort of already become the de facto
>
>>>>>> truth. Most people are in fact using Tika server, whether they are
>
> individual devs, govvies, commercial folk and the like.
>
>>>>>>
>
>>>>>> Big, small and medium projects. Evidenced by the expansion of Tika
>
> APIs into pretty much every PL I know and use of
>
>>>>>> actively today.
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> Given that, we probably should update the main website docs to make
>
> this more prominent. The tika server docs on the
>
>>>>>> wiki are pretty darn good. But they don’t get prime real estate.
>
> Would be wonderful if someone wants to update the
>
>>>>>> website to make it more prominent.
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> The downstream Tika Python lib that I maintain has tons of activity
>
> is used by more than 350+ projects and relies solely
>
>>>>>> on Tika-Server. My recommendation to the Solr folks (having created
>
> 7633) from the 2014 DARPA MEMEX days was to
>
>>>>>> move towards Tika Server based SolrCell dep and that’s the right
>
> way to go IMO.
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> Chris
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> From: Eric Pugh <epugh@opensourceconnections.com <mailto:
>
> epugh@opensourceconnections.com> <mailto:epugh@opensourceconnections.com
>
> <ma...@opensourceconnections.com>>>
>
>>>>>> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org>
>
> <mailto:dev@tika.apache.org <ma...@tika.apache.org>>" <
>
> dev@tika.apache.org <ma...@tika.apache.org> <mailto:
>
> dev@tika.apache.org <ma...@tika.apache.org>>>
>
>>>>>> Date: Wednesday, December 4, 2019 at 12:24 PM
>
>>>>>> To: "tika-dev@apache.org <ma...@apache.org> <mailto:
>
> tika-dev@apache.org <ma...@apache.org>>" <tika-dev@apache.org
>
> <ma...@apache.org> <mailto:tika-dev@apache.org <mailto:
>
> tika-dev@apache.org>>>
>
>>>>>> Subject: [EXTERNAL] Do we have a community supported approach for
>
> deploying Tika Server in production?
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user
>
> question!
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> Over in Solr land there has been renewed discussion about
>
> streamlining what Solr is....
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> In regards to rich content extraction and the Tika project, it
>
> seems like the two ideas that continue to preserve the existing behavior
>
> are:
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> 1) To convert the ExtractingRequestHandler into a Package (Plugin)
>
> for Solr. This slims down the standard Solr download, and *might* make it
>
> easier to update the version of Tika + dependent jars used?
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> 2) The second approach is to instead require Tika-Server to be
>
> running (https://issues.apache.org/jira/browse/SOLR-7633 <
>
> https://issues.apache.org/jira/browse/SOLR-7633><
>
> https://issues.apache.org/jira/browse/SOLR-7633 <
>
> https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr
>
> delegate the call to Tika-Server.
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> I was thinking about why I like option 1 better than 2, and I think
>
> it boils down to how mature the IT organization I am working with is. Some
>
> IT organizations have large dev-ops teams, and are working at major scale,
>
> and managing a fleet of Tika-Server on Kubernetes with Load Balancer
>
> dynamically scaling up and down is simple and second nature! However, many
>
> organizations aren’t like that.
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> So I guess what I’m asking is do we have a reasonable supported
>
> approach for deploying Tika Server for non-tika savvy organizations? I’m
>
> thinking about Solr, and specifically the fact that Solr has a well defined
>
> set of Service Installation scripts. When I follow the directions in
>
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>
> <
>
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>
>> <
>
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>
> <
>
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>>
>
> I can feel confident that when the server is rebooted, then Solr will come
>
> back up! Plus there is log rotation and all the rest.
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> In contrast, when I look at Tika website, specifically
>
> https://tika.apache.org/1.22/gettingstarted.htm <
>
> https://tika.apache.org/1.22/gettingstarted.htm><
>
> https://tika.apache.org/1.22/gettingstarted.htm <
>
> https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is
>
> to run Tika as a command line application, or embedded in your
>
> application.
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> I’m wondering if Tika-Server needs to be made more prominent, and
>
> treated as the “primary method of interacting with Tika”? Do we need as a
>
> community to focus more on Tika-Server? In our getting started
>
> documentation, in our usage documentation, and in our examples?
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> Do we need to create the equivalent of the Service Installation
>
> scripts for Tika-Server?
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> Wanted to stoke the discussion!
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> Eric
>
>>>>>>
>
>>>>>>
>
>>>>>>
>
>>>>>> _______________________
>
>>>>>>
>
>>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
>
> 434.466.1467 | http://www.opensourceconnections.com <
>
> http://www.opensourceconnections.com/><
>
> http://www.opensourceconnections.com/ <
>
> http://www.opensourceconnections.com/>><
>
> http://www.opensourceconnections.com/ <
>
> http://www.opensourceconnections.com/> <
>
> http://www.opensourceconnections.com/ <
>
> http://www.opensourceconnections.com/>>> | My Free/Busy <
>
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <
>
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>
>
>>>>>>
>
>>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>
> <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>
> <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>>
>
>
>
>>>>>>
>
>>>>>> This e-mail and all contents, including attachments, is considered
>
> to be Company Confidential unless explicitly stated otherwise, regardless
>
> of whether attachments are marked as such.
>
>>>>>
>
>>>>> _______________________
>
>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
>
> 434.466.1467 | http://www.opensourceconnections.com <
>
> http://www.opensourceconnections.com/><
>
> http://www.opensourceconnections.com/ <
>
> http://www.opensourceconnections.com/>> | My Free/Busy <
>
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>
>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>
> <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>
>
>
>>>>> This e-mail and all contents, including attachments, is considered
>
> to be Company Confidential unless explicitly stated otherwise, regardless
>
> of whether attachments are marked as such.
>
>>>>>
>
>>>>
>
>>>> _______________________
>
>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>
> | http://www.opensourceconnections.com <
>
> http://www.opensourceconnections.com/><
>
> http://www.opensourceconnections.com/ <
>
> http://www.opensourceconnections.com/>> | My Free/Busy <
>
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>
>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>
> <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>
>
>
>>>> This e-mail and all contents, including attachments, is considered to
>
> be Company Confidential unless explicitly stated otherwise, regardless of
>
> whether attachments are marked as such.
>
>>>>
>
>>>
>
>>> Spicule Limited is registered in England & Wales. Company Number:
>
> 09954122. Registered office: First Floor, Telecom House, 125-135 Preston
>
> Road, Brighton, England, BN1 6AF. VAT No. 251478891.
>
>>>
>
>>>
>
>>>
>
>>> All engagements are subject to Spicule Terms and Conditions of
>
> Business. This email and its contents are intended solely for the
>
> individual to whom it is addressed and may contain information that is
>
> confidential, privileged or otherwise protected from disclosure,
>
> distributing or copying. Any views or opinions presented in this email are
>
> solely those of the author and do not necessarily represent those of
>
> Spicule Limited. The company accepts no liability for any damage caused by
>
> any virus transmitted by this email. If you have received this message in
>
> error, please notify us immediately by reply email before deleting it from
>
> your system. Service of legal notice cannot be effected on Spicule Limited
>
> by email.
>
>>>
>
>>
>
>> _______________________
>
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>
> http://www.opensourceconnections.com <
>
> http://www.opensourceconnections.com/> | My Free/Busy <
>
> http://tinyurl.com/eric-cal>
>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
>
>
>> This e-mail and all contents, including attachments, is considered to be
>
> Company Confidential unless explicitly stated otherwise, regardless of
>
> whether attachments are marked as such.
>
>>
>
>
>
> _______________________
>
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>
> http://www.opensourceconnections.com <
>
> http://www.opensourceconnections.com/> | My Free/Busy <
>
> http://tinyurl.com/eric-cal>
>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
>
>
> This e-mail and all contents, including attachments, is considered to be
>
> Company Confidential unless explicitly stated otherwise, regardless of
>
> whether attachments are marked as such.
>
>
>
>
>
>
>
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Chris Mattmann <ma...@apache.org>.
+1
Note there is also a USC tika dockers repo where I put the data science stuff too:
http://github.com/USCDataScience/tika-dockers
I’ll continue to push DL and ML Tika stuff there.
Cheers,
Chris
From: Dave Meikle <dm...@apache.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, January 8, 2020 at 2:18 PM
To: "<de...@tika.apache.org>" <de...@tika.apache.org>
Subject: Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
Hi Eric,
Will take a look. On a related note, I've created a new repos:
https://github.com/apache/tika-docker
Thinking based on looking at the PRs and Issues on LogicalSpark
docker-tikaserver, I'll create an updated docker file using what you've
added here and look to publish builds to docker hub from that.
What do you think?
Cheers,
Dave
On Wed, 8 Jan 2020 at 03:16, Eric Pugh <ep...@opensourceconnections.com>
wrote:
Hi all, I’ve gone ahead and added the -spawnChild property as a default
when running Tika Server as a service. I’d love some eyes on the PR, and
if this looks good, get it committed.
Feedback welcome!
Eric
> On Dec 17, 2019, at 12:53 PM, Eric Pugh <ep...@opensourceconnections.com>
wrote:
>
> Cool.
>
> It’s the auto run that I really need, and the other part that I don’t
think I’ve tackled properly is the managing of logs…
>
> I’m going to check with my project to see if they support Snap packages.
>
> Eric
>
>
>> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk <mailto:
tom@spicule.co.uk>> wrote:
>>
>> Just saw this fly by and FYI on Linux systems that support Snap
packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server`
doesn’t yet auto-run I don’t believe but you can just run `tika-server.run`
and adding an init script wouldn’t take 5 minutes.
>>
>> Tom
>>
>> On 16 December 2019 at 18:42:55, Eric Pugh (
epugh@opensourceconnections.com <ma...@opensourceconnections.com>)
wrote:
>>
>>> Hi folks!
>>>
>>> I’ve got a mostly completed PR for having install scripts for Tika
Server, and I’m hoping a committer will take a look at the PR, and give
feedback (and ideally commit in time for 1.24!)
>>>
>>> A couple of things:
>>>
>>> 1) This was completely influenced by
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
<
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
><
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
<
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,
in fact I started with the Solr scripts.
>>>
>>> 2) I’ve deleted all the Solr specific aspects (I think), however there
may still be more to delete.
>>>
>>> 3) This requires a change to how we release Tika, previously we ship
tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we
want to add the tika-server-bin.tgz and tika-server-bin.zip binary
distributions.
>>>
>>> I’m happy to start writing accompanying “how to deploy Tika Server”
docs if this PR looks good! Or, please give input and I’ll make the updates.
>>>
>>> Eric
>>>
>>>
>>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh <
epugh@opensourceconnections.com <ma...@opensourceconnections.com>>
wrote:
>>> >
>>> > I’ve created this JIRA to track this work:
https://issues.apache.org/jira/browse/TIKA-3010 <
https://issues.apache.org/jira/browse/TIKA-3010> <
https://issues.apache.org/jira/browse/TIKA-3010 <
https://issues.apache.org/jira/browse/TIKA-3010>>
>>> >
>>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305
<https://github.com/apache/tika/pull/305> <
https://github.com/apache/tika/pull/305 <
https://github.com/apache/tika/pull/305>>
>>> >
>>> > My thought is to put something together that mimics how we deploy
Solr, and see how that works. I have a need for an install process that a
general IT person can follow, who isn’t a Tika expert or a Docker users.
>>> >
>>> >
>>> >
>>> >
>>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org
<ma...@apache.org> <mailto:mattmann@apache.org <mailto:
mattmann@apache.org>>> wrote:
>>> >>
>>> >> Thanks for bringing this conversation up Eric.
>>> >>
>>> >>
>>> >>
>>> >> Historically if you look over the last 5 years, I think what you
are asking below has sort of already become the de facto
>>> >> truth. Most people are in fact using Tika server, whether they are
individual devs, govvies, commercial folk and the like.
>>> >>
>>> >> Big, small and medium projects. Evidenced by the expansion of Tika
APIs into pretty much every PL I know and use of
>>> >> actively today.
>>> >>
>>> >>
>>> >>
>>> >> Given that, we probably should update the main website docs to make
this more prominent. The tika server docs on the
>>> >> wiki are pretty darn good. But they don’t get prime real estate.
Would be wonderful if someone wants to update the
>>> >> website to make it more prominent.
>>> >>
>>> >>
>>> >>
>>> >> The downstream Tika Python lib that I maintain has tons of activity
is used by more than 350+ projects and relies solely
>>> >> on Tika-Server. My recommendation to the Solr folks (having created
7633) from the 2014 DARPA MEMEX days was to
>>> >> move towards Tika Server based SolrCell dep and that’s the right
way to go IMO.
>>> >>
>>> >>
>>> >>
>>> >> Chris
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> From: Eric Pugh <epugh@opensourceconnections.com <mailto:
epugh@opensourceconnections.com> <mailto:epugh@opensourceconnections.com
<ma...@opensourceconnections.com>>>
>>> >> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org>
<mailto:dev@tika.apache.org <ma...@tika.apache.org>>" <
dev@tika.apache.org <ma...@tika.apache.org> <mailto:
dev@tika.apache.org <ma...@tika.apache.org>>>
>>> >> Date: Wednesday, December 4, 2019 at 12:24 PM
>>> >> To: "tika-dev@apache.org <ma...@apache.org> <mailto:
tika-dev@apache.org <ma...@apache.org>>" <tika-dev@apache.org
<ma...@apache.org> <mailto:tika-dev@apache.org <mailto:
tika-dev@apache.org>>>
>>> >> Subject: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
>>> >>
>>> >>
>>> >>
>>> >> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user
question!
>>> >>
>>> >>
>>> >>
>>> >> Over in Solr land there has been renewed discussion about
streamlining what Solr is....
>>> >>
>>> >>
>>> >>
>>> >> In regards to rich content extraction and the Tika project, it
seems like the two ideas that continue to preserve the existing behavior
are:
>>> >>
>>> >>
>>> >>
>>> >> 1) To convert the ExtractingRequestHandler into a Package (Plugin)
for Solr. This slims down the standard Solr download, and *might* make it
easier to update the version of Tika + dependent jars used?
>>> >>
>>> >>
>>> >>
>>> >> 2) The second approach is to instead require Tika-Server to be
running (https://issues.apache.org/jira/browse/SOLR-7633 <
https://issues.apache.org/jira/browse/SOLR-7633><
https://issues.apache.org/jira/browse/SOLR-7633 <
https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr
delegate the call to Tika-Server.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> I was thinking about why I like option 1 better than 2, and I think
it boils down to how mature the IT organization I am working with is. Some
IT organizations have large dev-ops teams, and are working at major scale,
and managing a fleet of Tika-Server on Kubernetes with Load Balancer
dynamically scaling up and down is simple and second nature! However, many
organizations aren’t like that.
>>> >>
>>> >>
>>> >>
>>> >> So I guess what I’m asking is do we have a reasonable supported
approach for deploying Tika Server for non-tika savvy organizations? I’m
thinking about Solr, and specifically the fact that Solr has a well defined
set of Service Installation scripts. When I follow the directions in
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
<
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
><
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
<
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>>
I can feel confident that when the server is rebooted, then Solr will come
back up! Plus there is log rotation and all the rest.
>>> >>
>>> >>
>>> >>
>>> >> In contrast, when I look at Tika website, specifically
https://tika.apache.org/1.22/gettingstarted.htm <
https://tika.apache.org/1.22/gettingstarted.htm><
https://tika.apache.org/1.22/gettingstarted.htm <
https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is
to run Tika as a command line application, or embedded in your
application.
>>> >>
>>> >>
>>> >>
>>> >> I’m wondering if Tika-Server needs to be made more prominent, and
treated as the “primary method of interacting with Tika”? Do we need as a
community to focus more on Tika-Server? In our getting started
documentation, in our usage documentation, and in our examples?
>>> >>
>>> >>
>>> >>
>>> >> Do we need to create the equivalent of the Service Installation
scripts for Tika-Server?
>>> >>
>>> >>
>>> >>
>>> >> Wanted to stoke the discussion!
>>> >>
>>> >>
>>> >>
>>> >> Eric
>>> >>
>>> >>
>>> >>
>>> >> _______________________
>>> >>
>>> >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
434.466.1467 | http://www.opensourceconnections.com <
http://www.opensourceconnections.com/><
http://www.opensourceconnections.com/ <
http://www.opensourceconnections.com/>><
http://www.opensourceconnections.com/ <
http://www.opensourceconnections.com/> <
http://www.opensourceconnections.com/ <
http://www.opensourceconnections.com/>>> | My Free/Busy <
http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <
http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>
>>> >>
>>> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
<
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>>
>>> >>
>>> >> This e-mail and all contents, including attachments, is considered
to be Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
>>> >
>>> > _______________________
>>> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
434.466.1467 | http://www.opensourceconnections.com <
http://www.opensourceconnections.com/><
http://www.opensourceconnections.com/ <
http://www.opensourceconnections.com/>> | My Free/Busy <
http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>>> > This e-mail and all contents, including attachments, is considered
to be Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
>>> >
>>>
>>> _______________________
>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467
| http://www.opensourceconnections.com <
http://www.opensourceconnections.com/><
http://www.opensourceconnections.com/ <
http://www.opensourceconnections.com/>> | My Free/Busy <
http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>>> This e-mail and all contents, including attachments, is considered to
be Company Confidential unless explicitly stated otherwise, regardless of
whether attachments are marked as such.
>>>
>>
>> Spicule Limited is registered in England & Wales. Company Number:
09954122. Registered office: First Floor, Telecom House, 125-135 Preston
Road, Brighton, England, BN1 6AF. VAT No. 251478891.
>>
>>
>>
>> All engagements are subject to Spicule Terms and Conditions of
Business. This email and its contents are intended solely for the
individual to whom it is addressed and may contain information that is
confidential, privileged or otherwise protected from disclosure,
distributing or copying. Any views or opinions presented in this email are
solely those of the author and do not necessarily represent those of
Spicule Limited. The company accepts no liability for any damage caused by
any virus transmitted by this email. If you have received this message in
error, please notify us immediately by reply email before deleting it from
your system. Service of legal notice cannot be effected on Spicule Limited
by email.
>>
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com <
http://www.opensourceconnections.com/> | My Free/Busy <
http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless of
whether attachments are marked as such.
>
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com <
http://www.opensourceconnections.com/> | My Free/Busy <
http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless of
whether attachments are marked as such.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Dave Meikle <dm...@apache.org>.
Hi Eric,
Will take a look. On a related note, I've created a new repos:
https://github.com/apache/tika-docker
Thinking based on looking at the PRs and Issues on LogicalSpark
docker-tikaserver, I'll create an updated docker file using what you've
added here and look to publish builds to docker hub from that.
What do you think?
Cheers,
Dave
On Wed, 8 Jan 2020 at 03:16, Eric Pugh <ep...@opensourceconnections.com>
wrote:
> Hi all, I’ve gone ahead and added the -spawnChild property as a default
> when running Tika Server as a service. I’d love some eyes on the PR, and
> if this looks good, get it committed.
>
> Feedback welcome!
>
> Eric
>
>
>
> > On Dec 17, 2019, at 12:53 PM, Eric Pugh <ep...@opensourceconnections.com>
> wrote:
> >
> > Cool.
> >
> > It’s the auto run that I really need, and the other part that I don’t
> think I’ve tackled properly is the managing of logs…
> >
> > I’m going to check with my project to see if they support Snap packages.
> >
> > Eric
> >
> >
> >> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk <mailto:
> tom@spicule.co.uk>> wrote:
> >>
> >> Just saw this fly by and FYI on Linux systems that support Snap
> packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server`
> doesn’t yet auto-run I don’t believe but you can just run `tika-server.run`
> and adding an init script wouldn’t take 5 minutes.
> >>
> >> Tom
> >>
> >> On 16 December 2019 at 18:42:55, Eric Pugh (
> epugh@opensourceconnections.com <ma...@opensourceconnections.com>)
> wrote:
> >>
> >>> Hi folks!
> >>>
> >>> I’ve got a mostly completed PR for having install scripts for Tika
> Server, and I’m hoping a committer will take a look at the PR, and give
> feedback (and ideally commit in time for 1.24!)
> >>>
> >>> A couple of things:
> >>>
> >>> 1) This was completely influenced by
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> <
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> ><
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> <
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,
> in fact I started with the Solr scripts.
> >>>
> >>> 2) I’ve deleted all the Solr specific aspects (I think), however there
> may still be more to delete.
> >>>
> >>> 3) This requires a change to how we release Tika, previously we ship
> tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we
> want to add the tika-server-bin.tgz and tika-server-bin.zip binary
> distributions.
> >>>
> >>> I’m happy to start writing accompanying “how to deploy Tika Server”
> docs if this PR looks good! Or, please give input and I’ll make the updates.
> >>>
> >>> Eric
> >>>
> >>>
> >>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh <
> epugh@opensourceconnections.com <ma...@opensourceconnections.com>>
> wrote:
> >>> >
> >>> > I’ve created this JIRA to track this work:
> https://issues.apache.org/jira/browse/TIKA-3010 <
> https://issues.apache.org/jira/browse/TIKA-3010> <
> https://issues.apache.org/jira/browse/TIKA-3010 <
> https://issues.apache.org/jira/browse/TIKA-3010>>
> >>> >
> >>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305
> <https://github.com/apache/tika/pull/305> <
> https://github.com/apache/tika/pull/305 <
> https://github.com/apache/tika/pull/305>>
> >>> >
> >>> > My thought is to put something together that mimics how we deploy
> Solr, and see how that works. I have a need for an install process that a
> general IT person can follow, who isn’t a Tika expert or a Docker users.
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org
> <ma...@apache.org> <mailto:mattmann@apache.org <mailto:
> mattmann@apache.org>>> wrote:
> >>> >>
> >>> >> Thanks for bringing this conversation up Eric.
> >>> >>
> >>> >>
> >>> >>
> >>> >> Historically if you look over the last 5 years, I think what you
> are asking below has sort of already become the de facto
> >>> >> truth. Most people are in fact using Tika server, whether they are
> individual devs, govvies, commercial folk and the like.
> >>> >>
> >>> >> Big, small and medium projects. Evidenced by the expansion of Tika
> APIs into pretty much every PL I know and use of
> >>> >> actively today.
> >>> >>
> >>> >>
> >>> >>
> >>> >> Given that, we probably should update the main website docs to make
> this more prominent. The tika server docs on the
> >>> >> wiki are pretty darn good. But they don’t get prime real estate.
> Would be wonderful if someone wants to update the
> >>> >> website to make it more prominent.
> >>> >>
> >>> >>
> >>> >>
> >>> >> The downstream Tika Python lib that I maintain has tons of activity
> is used by more than 350+ projects and relies solely
> >>> >> on Tika-Server. My recommendation to the Solr folks (having created
> 7633) from the 2014 DARPA MEMEX days was to
> >>> >> move towards Tika Server based SolrCell dep and that’s the right
> way to go IMO.
> >>> >>
> >>> >>
> >>> >>
> >>> >> Chris
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> From: Eric Pugh <epugh@opensourceconnections.com <mailto:
> epugh@opensourceconnections.com> <mailto:epugh@opensourceconnections.com
> <ma...@opensourceconnections.com>>>
> >>> >> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org>
> <mailto:dev@tika.apache.org <ma...@tika.apache.org>>" <
> dev@tika.apache.org <ma...@tika.apache.org> <mailto:
> dev@tika.apache.org <ma...@tika.apache.org>>>
> >>> >> Date: Wednesday, December 4, 2019 at 12:24 PM
> >>> >> To: "tika-dev@apache.org <ma...@apache.org> <mailto:
> tika-dev@apache.org <ma...@apache.org>>" <tika-dev@apache.org
> <ma...@apache.org> <mailto:tika-dev@apache.org <mailto:
> tika-dev@apache.org>>>
> >>> >> Subject: [EXTERNAL] Do we have a community supported approach for
> deploying Tika Server in production?
> >>> >>
> >>> >>
> >>> >>
> >>> >> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user
> question!
> >>> >>
> >>> >>
> >>> >>
> >>> >> Over in Solr land there has been renewed discussion about
> streamlining what Solr is....
> >>> >>
> >>> >>
> >>> >>
> >>> >> In regards to rich content extraction and the Tika project, it
> seems like the two ideas that continue to preserve the existing behavior
> are:
> >>> >>
> >>> >>
> >>> >>
> >>> >> 1) To convert the ExtractingRequestHandler into a Package (Plugin)
> for Solr. This slims down the standard Solr download, and *might* make it
> easier to update the version of Tika + dependent jars used?
> >>> >>
> >>> >>
> >>> >>
> >>> >> 2) The second approach is to instead require Tika-Server to be
> running (https://issues.apache.org/jira/browse/SOLR-7633 <
> https://issues.apache.org/jira/browse/SOLR-7633><
> https://issues.apache.org/jira/browse/SOLR-7633 <
> https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr
> delegate the call to Tika-Server.
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> I was thinking about why I like option 1 better than 2, and I think
> it boils down to how mature the IT organization I am working with is. Some
> IT organizations have large dev-ops teams, and are working at major scale,
> and managing a fleet of Tika-Server on Kubernetes with Load Balancer
> dynamically scaling up and down is simple and second nature! However, many
> organizations aren’t like that.
> >>> >>
> >>> >>
> >>> >>
> >>> >> So I guess what I’m asking is do we have a reasonable supported
> approach for deploying Tika Server for non-tika savvy organizations? I’m
> thinking about Solr, and specifically the fact that Solr has a well defined
> set of Service Installation scripts. When I follow the directions in
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> <
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> ><
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> <
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>>
> I can feel confident that when the server is rebooted, then Solr will come
> back up! Plus there is log rotation and all the rest.
> >>> >>
> >>> >>
> >>> >>
> >>> >> In contrast, when I look at Tika website, specifically
> https://tika.apache.org/1.22/gettingstarted.htm <
> https://tika.apache.org/1.22/gettingstarted.htm><
> https://tika.apache.org/1.22/gettingstarted.htm <
> https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is
> to run Tika as a command line application, or embedded in your
> application.
> >>> >>
> >>> >>
> >>> >>
> >>> >> I’m wondering if Tika-Server needs to be made more prominent, and
> treated as the “primary method of interacting with Tika”? Do we need as a
> community to focus more on Tika-Server? In our getting started
> documentation, in our usage documentation, and in our examples?
> >>> >>
> >>> >>
> >>> >>
> >>> >> Do we need to create the equivalent of the Service Installation
> scripts for Tika-Server?
> >>> >>
> >>> >>
> >>> >>
> >>> >> Wanted to stoke the discussion!
> >>> >>
> >>> >>
> >>> >>
> >>> >> Eric
> >>> >>
> >>> >>
> >>> >>
> >>> >> _______________________
> >>> >>
> >>> >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
> 434.466.1467 | http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/><
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/>><
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/> <
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/>>> | My Free/Busy <
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>
> >>> >>
> >>> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>>
>
> >>> >>
> >>> >> This e-mail and all contents, including attachments, is considered
> to be Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
> >>> >
> >>> > _______________________
> >>> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
> 434.466.1467 | http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/><
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/>> | My Free/Busy <
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
> >>> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>
> >>> > This e-mail and all contents, including attachments, is considered
> to be Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
> >>> >
> >>>
> >>> _______________________
> >>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467
> | http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/><
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/>> | My Free/Busy <
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
> >>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>
> >>> This e-mail and all contents, including attachments, is considered to
> be Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
> >>>
> >>
> >> Spicule Limited is registered in England & Wales. Company Number:
> 09954122. Registered office: First Floor, Telecom House, 125-135 Preston
> Road, Brighton, England, BN1 6AF. VAT No. 251478891.
> >>
> >>
> >>
> >> All engagements are subject to Spicule Terms and Conditions of
> Business. This email and its contents are intended solely for the
> individual to whom it is addressed and may contain information that is
> confidential, privileged or otherwise protected from disclosure,
> distributing or copying. Any views or opinions presented in this email are
> solely those of the author and do not necessarily represent those of
> Spicule Limited. The company accepts no liability for any damage caused by
> any virus transmitted by this email. If you have received this message in
> error, please notify us immediately by reply email before deleting it from
> your system. Service of legal notice cannot be effected on Spicule Limited
> by email.
> >>
> >
> > _______________________
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> > This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
> >
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Eric Pugh <ep...@opensourceconnections.com>.
Hi all, I’ve gone ahead and added the -spawnChild property as a default when running Tika Server as a service. I’d love some eyes on the PR, and if this looks good, get it committed.
Feedback welcome!
Eric
> On Dec 17, 2019, at 12:53 PM, Eric Pugh <ep...@opensourceconnections.com> wrote:
>
> Cool.
>
> It’s the auto run that I really need, and the other part that I don’t think I’ve tackled properly is the managing of logs…
>
> I’m going to check with my project to see if they support Snap packages.
>
> Eric
>
>
>> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk <ma...@spicule.co.uk>> wrote:
>>
>> Just saw this fly by and FYI on Linux systems that support Snap packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server` doesn’t yet auto-run I don’t believe but you can just run `tika-server.run` and adding an init script wouldn’t take 5 minutes.
>>
>> Tom
>>
>> On 16 December 2019 at 18:42:55, Eric Pugh (epugh@opensourceconnections.com <ma...@opensourceconnections.com>) wrote:
>>
>>> Hi folks!
>>>
>>> I’ve got a mostly completed PR for having install scripts for Tika Server, and I’m hoping a committer will take a look at the PR, and give feedback (and ideally commit in time for 1.24!)
>>>
>>> A couple of things:
>>>
>>> 1) This was completely influenced by https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script><https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>, in fact I started with the Solr scripts.
>>>
>>> 2) I’ve deleted all the Solr specific aspects (I think), however there may still be more to delete.
>>>
>>> 3) This requires a change to how we release Tika, previously we ship tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we want to add the tika-server-bin.tgz and tika-server-bin.zip binary distributions.
>>>
>>> I’m happy to start writing accompanying “how to deploy Tika Server” docs if this PR looks good! Or, please give input and I’ll make the updates.
>>>
>>> Eric
>>>
>>>
>>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh <epugh@opensourceconnections.com <ma...@opensourceconnections.com>> wrote:
>>> >
>>> > I’ve created this JIRA to track this work: https://issues.apache.org/jira/browse/TIKA-3010 <https://issues.apache.org/jira/browse/TIKA-3010> <https://issues.apache.org/jira/browse/TIKA-3010 <https://issues.apache.org/jira/browse/TIKA-3010>>
>>> >
>>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305 <https://github.com/apache/tika/pull/305> <https://github.com/apache/tika/pull/305 <https://github.com/apache/tika/pull/305>>
>>> >
>>> > My thought is to put something together that mimics how we deploy Solr, and see how that works. I have a need for an install process that a general IT person can follow, who isn’t a Tika expert or a Docker users.
>>> >
>>> >
>>> >
>>> >
>>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org <ma...@apache.org> <mailto:mattmann@apache.org <ma...@apache.org>>> wrote:
>>> >>
>>> >> Thanks for bringing this conversation up Eric.
>>> >>
>>> >>
>>> >>
>>> >> Historically if you look over the last 5 years, I think what you are asking below has sort of already become the de facto
>>> >> truth. Most people are in fact using Tika server, whether they are individual devs, govvies, commercial folk and the like.
>>> >>
>>> >> Big, small and medium projects. Evidenced by the expansion of Tika APIs into pretty much every PL I know and use of
>>> >> actively today.
>>> >>
>>> >>
>>> >>
>>> >> Given that, we probably should update the main website docs to make this more prominent. The tika server docs on the
>>> >> wiki are pretty darn good. But they don’t get prime real estate. Would be wonderful if someone wants to update the
>>> >> website to make it more prominent.
>>> >>
>>> >>
>>> >>
>>> >> The downstream Tika Python lib that I maintain has tons of activity is used by more than 350+ projects and relies solely
>>> >> on Tika-Server. My recommendation to the Solr folks (having created 7633) from the 2014 DARPA MEMEX days was to
>>> >> move towards Tika Server based SolrCell dep and that’s the right way to go IMO.
>>> >>
>>> >>
>>> >>
>>> >> Chris
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> From: Eric Pugh <epugh@opensourceconnections.com <ma...@opensourceconnections.com> <mailto:epugh@opensourceconnections.com <ma...@opensourceconnections.com>>>
>>> >> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org> <mailto:dev@tika.apache.org <ma...@tika.apache.org>>" <dev@tika.apache.org <ma...@tika.apache.org> <mailto:dev@tika.apache.org <ma...@tika.apache.org>>>
>>> >> Date: Wednesday, December 4, 2019 at 12:24 PM
>>> >> To: "tika-dev@apache.org <ma...@apache.org> <mailto:tika-dev@apache.org <ma...@apache.org>>" <tika-dev@apache.org <ma...@apache.org> <mailto:tika-dev@apache.org <ma...@apache.org>>>
>>> >> Subject: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
>>> >>
>>> >>
>>> >>
>>> >> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
>>> >>
>>> >>
>>> >>
>>> >> Over in Solr land there has been renewed discussion about streamlining what Solr is....
>>> >>
>>> >>
>>> >>
>>> >> In regards to rich content extraction and the Tika project, it seems like the two ideas that continue to preserve the existing behavior are:
>>> >>
>>> >>
>>> >>
>>> >> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr. This slims down the standard Solr download, and *might* make it easier to update the version of Tika + dependent jars used?
>>> >>
>>> >>
>>> >>
>>> >> 2) The second approach is to instead require Tika-Server to be running (https://issues.apache.org/jira/browse/SOLR-7633 <https://issues.apache.org/jira/browse/SOLR-7633><https://issues.apache.org/jira/browse/SOLR-7633 <https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr delegate the call to Tika-Server.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> I was thinking about why I like option 1 better than 2, and I think it boils down to how mature the IT organization I am working with is. Some IT organizations have large dev-ops teams, and are working at major scale, and managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically scaling up and down is simple and second nature! However, many organizations aren’t like that.
>>> >>
>>> >>
>>> >>
>>> >> So I guess what I’m asking is do we have a reasonable supported approach for deploying Tika Server for non-tika savvy organizations? I’m thinking about Solr, and specifically the fact that Solr has a well defined set of Service Installation scripts. When I follow the directions in https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production><https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>> I can feel confident that when the server is rebooted, then Solr will come back up! Plus there is log rotation and all the rest.
>>> >>
>>> >>
>>> >>
>>> >> In contrast, when I look at Tika website, specifically https://tika.apache.org/1.22/gettingstarted.htm <https://tika.apache.org/1.22/gettingstarted.htm><https://tika.apache.org/1.22/gettingstarted.htm <https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is to run Tika as a command line application, or embedded in your application.
>>> >>
>>> >>
>>> >>
>>> >> I’m wondering if Tika-Server needs to be made more prominent, and treated as the “primary method of interacting with Tika”? Do we need as a community to focus more on Tika-Server? In our getting started documentation, in our usage documentation, and in our examples?
>>> >>
>>> >>
>>> >>
>>> >> Do we need to create the equivalent of the Service Installation scripts for Tika-Server?
>>> >>
>>> >>
>>> >>
>>> >> Wanted to stoke the discussion!
>>> >>
>>> >>
>>> >>
>>> >> Eric
>>> >>
>>> >>
>>> >>
>>> >> _______________________
>>> >>
>>> >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/> <http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>>> | My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>
>>> >>
>>> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>>
>>> >>
>>> >> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>>> >
>>> > _______________________
>>> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>> | My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>>> > This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>>> >
>>>
>>> _______________________
>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>> | My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>>> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>>>
>>
>> Spicule Limited is registered in England & Wales. Company Number: 09954122. Registered office: First Floor, Telecom House, 125-135 Preston Road, Brighton, England, BN1 6AF. VAT No. 251478891.
>>
>>
>>
>> All engagements are subject to Spicule Terms and Conditions of Business. This email and its contents are intended solely for the individual to whom it is addressed and may contain information that is confidential, privileged or otherwise protected from disclosure, distributing or copying. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Spicule Limited. The company accepts no liability for any damage caused by any virus transmitted by this email. If you have received this message in error, please notify us immediately by reply email before deleting it from your system. Service of legal notice cannot be effected on Spicule Limited by email.
>>
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Eric Pugh <ep...@opensourceconnections.com>.
Cool.
It’s the auto run that I really need, and the other part that I don’t think I’ve tackled properly is the managing of logs…
I’m going to check with my project to see if they support Snap packages.
Eric
> On Dec 16, 2019, at 5:10 PM, Tom Barber <to...@spicule.co.uk> wrote:
>
> Just saw this fly by and FYI on Linux systems that support Snap packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server` doesn’t yet auto-run I don’t believe but you can just run `tika-server.run` and adding an init script wouldn’t take 5 minutes.
>
> Tom
>
> On 16 December 2019 at 18:42:55, Eric Pugh (epugh@opensourceconnections.com <ma...@opensourceconnections.com>) wrote:
>
>> Hi folks!
>>
>> I’ve got a mostly completed PR for having install scripts for Tika Server, and I’m hoping a committer will take a look at the PR, and give feedback (and ideally commit in time for 1.24!)
>>
>> A couple of things:
>>
>> 1) This was completely influenced by https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script><https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>, in fact I started with the Solr scripts.
>>
>> 2) I’ve deleted all the Solr specific aspects (I think), however there may still be more to delete.
>>
>> 3) This requires a change to how we release Tika, previously we ship tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we want to add the tika-server-bin.tgz and tika-server-bin.zip binary distributions.
>>
>> I’m happy to start writing accompanying “how to deploy Tika Server” docs if this PR looks good! Or, please give input and I’ll make the updates.
>>
>> Eric
>>
>>
>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh <epugh@opensourceconnections.com <ma...@opensourceconnections.com>> wrote:
>> >
>> > I’ve created this JIRA to track this work: https://issues.apache.org/jira/browse/TIKA-3010 <https://issues.apache.org/jira/browse/TIKA-3010> <https://issues.apache.org/jira/browse/TIKA-3010 <https://issues.apache.org/jira/browse/TIKA-3010>>
>> >
>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305 <https://github.com/apache/tika/pull/305> <https://github.com/apache/tika/pull/305 <https://github.com/apache/tika/pull/305>>
>> >
>> > My thought is to put something together that mimics how we deploy Solr, and see how that works. I have a need for an install process that a general IT person can follow, who isn’t a Tika expert or a Docker users.
>> >
>> >
>> >
>> >
>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org <ma...@apache.org> <mailto:mattmann@apache.org <ma...@apache.org>>> wrote:
>> >>
>> >> Thanks for bringing this conversation up Eric.
>> >>
>> >>
>> >>
>> >> Historically if you look over the last 5 years, I think what you are asking below has sort of already become the de facto
>> >> truth. Most people are in fact using Tika server, whether they are individual devs, govvies, commercial folk and the like.
>> >>
>> >> Big, small and medium projects. Evidenced by the expansion of Tika APIs into pretty much every PL I know and use of
>> >> actively today.
>> >>
>> >>
>> >>
>> >> Given that, we probably should update the main website docs to make this more prominent. The tika server docs on the
>> >> wiki are pretty darn good. But they don’t get prime real estate. Would be wonderful if someone wants to update the
>> >> website to make it more prominent.
>> >>
>> >>
>> >>
>> >> The downstream Tika Python lib that I maintain has tons of activity is used by more than 350+ projects and relies solely
>> >> on Tika-Server. My recommendation to the Solr folks (having created 7633) from the 2014 DARPA MEMEX days was to
>> >> move towards Tika Server based SolrCell dep and that’s the right way to go IMO.
>> >>
>> >>
>> >>
>> >> Chris
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> From: Eric Pugh <epugh@opensourceconnections.com <ma...@opensourceconnections.com> <mailto:epugh@opensourceconnections.com <ma...@opensourceconnections.com>>>
>> >> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org> <mailto:dev@tika.apache.org <ma...@tika.apache.org>>" <dev@tika.apache.org <ma...@tika.apache.org> <mailto:dev@tika.apache.org <ma...@tika.apache.org>>>
>> >> Date: Wednesday, December 4, 2019 at 12:24 PM
>> >> To: "tika-dev@apache.org <ma...@apache.org> <mailto:tika-dev@apache.org <ma...@apache.org>>" <tika-dev@apache.org <ma...@apache.org> <mailto:tika-dev@apache.org <ma...@apache.org>>>
>> >> Subject: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
>> >>
>> >>
>> >>
>> >> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
>> >>
>> >>
>> >>
>> >> Over in Solr land there has been renewed discussion about streamlining what Solr is....
>> >>
>> >>
>> >>
>> >> In regards to rich content extraction and the Tika project, it seems like the two ideas that continue to preserve the existing behavior are:
>> >>
>> >>
>> >>
>> >> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr. This slims down the standard Solr download, and *might* make it easier to update the version of Tika + dependent jars used?
>> >>
>> >>
>> >>
>> >> 2) The second approach is to instead require Tika-Server to be running (https://issues.apache.org/jira/browse/SOLR-7633 <https://issues.apache.org/jira/browse/SOLR-7633><https://issues.apache.org/jira/browse/SOLR-7633 <https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr delegate the call to Tika-Server.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> I was thinking about why I like option 1 better than 2, and I think it boils down to how mature the IT organization I am working with is. Some IT organizations have large dev-ops teams, and are working at major scale, and managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically scaling up and down is simple and second nature! However, many organizations aren’t like that.
>> >>
>> >>
>> >>
>> >> So I guess what I’m asking is do we have a reasonable supported approach for deploying Tika Server for non-tika savvy organizations? I’m thinking about Solr, and specifically the fact that Solr has a well defined set of Service Installation scripts. When I follow the directions in https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production><https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>> I can feel confident that when the server is rebooted, then Solr will come back up! Plus there is log rotation and all the rest.
>> >>
>> >>
>> >>
>> >> In contrast, when I look at Tika website, specifically https://tika.apache.org/1.22/gettingstarted.htm <https://tika.apache.org/1.22/gettingstarted.htm><https://tika.apache.org/1.22/gettingstarted.htm <https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is to run Tika as a command line application, or embedded in your application.
>> >>
>> >>
>> >>
>> >> I’m wondering if Tika-Server needs to be made more prominent, and treated as the “primary method of interacting with Tika”? Do we need as a community to focus more on Tika-Server? In our getting started documentation, in our usage documentation, and in our examples?
>> >>
>> >>
>> >>
>> >> Do we need to create the equivalent of the Service Installation scripts for Tika-Server?
>> >>
>> >>
>> >>
>> >> Wanted to stoke the discussion!
>> >>
>> >>
>> >>
>> >> Eric
>> >>
>> >>
>> >>
>> >> _______________________
>> >>
>> >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/> <http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>>> | My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>
>> >>
>> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>>
>> >>
>> >> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>> >
>> > _______________________
>> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>> | My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>> > This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>> >
>>
>> _______________________
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>> | My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>>
>
> Spicule Limited is registered in England & Wales. Company Number: 09954122. Registered office: First Floor, Telecom House, 125-135 Preston Road, Brighton, England, BN1 6AF. VAT No. 251478891.
>
>
>
> All engagements are subject to Spicule Terms and Conditions of Business. This email and its contents are intended solely for the individual to whom it is addressed and may contain information that is confidential, privileged or otherwise protected from disclosure, distributing or copying. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Spicule Limited. The company accepts no liability for any damage caused by any virus transmitted by this email. If you have received this message in error, please notify us immediately by reply email before deleting it from your system. Service of legal notice cannot be effected on Spicule Limited by email.
>
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Tom Barber <to...@spicule.co.uk>.
Just saw this fly by and FYI on Linux systems that support Snap packages
(Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server` doesn’t
yet auto-run I don’t believe but you can just run `tika-server.run` and
adding an init script wouldn’t take 5 minutes.
Tom
On 16 December 2019 at 18:42:55, Eric Pugh (epugh@opensourceconnections.com)
wrote:
Hi folks!
I’ve got a mostly completed PR for having install scripts for Tika Server,
and I’m hoping a committer will take a look at the PR, and give feedback
(and ideally commit in time for 1.24!)
A couple of things:
1) This was completely influenced by
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
<
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>,
in fact I started with the Solr scripts.
2) I’ve deleted all the Solr specific aspects (I think), however there may
still be more to delete.
3) This requires a change to how we release Tika, previously we ship
tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we
want to add the tika-server-bin.tgz and tika-server-bin.zip binary
distributions.
I’m happy to start writing accompanying “how to deploy Tika Server” docs if
this PR looks good! Or, please give input and I’ll make the updates.
Eric
> On Dec 12, 2019, at 2:39 PM, Eric Pugh <ep...@opensourceconnections.com>
wrote:
>
> I’ve created this JIRA to track this work:
https://issues.apache.org/jira/browse/TIKA-3010 <
https://issues.apache.org/jira/browse/TIKA-3010>
>
> And a WIP progress PR is at https://github.com/apache/tika/pull/305 <
https://github.com/apache/tika/pull/305>
>
> My thought is to put something together that mimics how we deploy Solr,
and see how that works. I have a need for an install process that a general
IT person can follow, who isn’t a Tika expert or a Docker users.
>
>
>
>
>> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org <mailto:
mattmann@apache.org>> wrote:
>>
>> Thanks for bringing this conversation up Eric.
>>
>>
>>
>> Historically if you look over the last 5 years, I think what you are
asking below has sort of already become the de facto
>> truth. Most people are in fact using Tika server, whether they are
individual devs, govvies, commercial folk and the like.
>>
>> Big, small and medium projects. Evidenced by the expansion of Tika APIs
into pretty much every PL I know and use of
>> actively today.
>>
>>
>>
>> Given that, we probably should update the main website docs to make this
more prominent. The tika server docs on the
>> wiki are pretty darn good. But they don’t get prime real estate. Would
be wonderful if someone wants to update the
>> website to make it more prominent.
>>
>>
>>
>> The downstream Tika Python lib that I maintain has tons of activity is
used by more than 350+ projects and relies solely
>> on Tika-Server. My recommendation to the Solr folks (having created
7633) from the 2014 DARPA MEMEX days was to
>> move towards Tika Server based SolrCell dep and that’s the right way to
go IMO.
>>
>>
>>
>> Chris
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> From: Eric Pugh <epugh@opensourceconnections.com <mailto:
epugh@opensourceconnections.com>>
>> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org>" <
dev@tika.apache.org <ma...@tika.apache.org>>
>> Date: Wednesday, December 4, 2019 at 12:24 PM
>> To: "tika-dev@apache.org <ma...@apache.org>" <
tika-dev@apache.org <ma...@apache.org>>
>> Subject: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
>>
>>
>>
>> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
>>
>>
>>
>> Over in Solr land there has been renewed discussion about streamlining
what Solr is....
>>
>>
>>
>> In regards to rich content extraction and the Tika project, it seems
like the two ideas that continue to preserve the existing behavior are:
>>
>>
>>
>> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for
Solr. This slims down the standard Solr download, and *might* make it
easier to update the version of Tika + dependent jars used?
>>
>>
>>
>> 2) The second approach is to instead require Tika-Server to be running (
https://issues.apache.org/jira/browse/SOLR-7633 <
https://issues.apache.org/jira/browse/SOLR-7633>) and just have Solr
delegate the call to Tika-Server.
>>
>>
>>
>>
>>
>> I was thinking about why I like option 1 better than 2, and I think it
boils down to how mature the IT organization I am working with is. Some IT
organizations have large dev-ops teams, and are working at major scale, and
managing a fleet of Tika-Server on Kubernetes with Load Balancer
dynamically scaling up and down is simple and second nature! However, many
organizations aren’t like that.
>>
>>
>>
>> So I guess what I’m asking is do we have a reasonable supported approach
for deploying Tika Server for non-tika savvy organizations? I’m thinking
about Solr, and specifically the fact that Solr has a well defined set of
Service Installation scripts. When I follow the directions in
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
<
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>
I can feel confident that when the server is rebooted, then Solr will come
back up! Plus there is log rotation and all the rest.
>>
>>
>>
>> In contrast, when I look at Tika website, specifically
https://tika.apache.org/1.22/gettingstarted.htm <
https://tika.apache.org/1.22/gettingstarted.htm> pagel, the message is to
run Tika as a command line application, or embedded in your application.
>>
>>
>>
>> I’m wondering if Tika-Server needs to be made more prominent, and
treated as the “primary method of interacting with Tika”? Do we need as a
community to focus more on Tika-Server? In our getting started
documentation, in our usage documentation, and in our examples?
>>
>>
>>
>> Do we need to create the equivalent of the Service Installation scripts
for Tika-Server?
>>
>>
>>
>> Wanted to stoke the discussion!
>>
>>
>>
>> Eric
>>
>>
>>
>> _______________________
>>
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com <http://www.opensourceconnections.com/
><http://www.opensourceconnections.com/ <
http://www.opensourceconnections.com/>> | My Free/Busy <
http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>>
>> This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless of
whether attachments are marked as such.
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com <http://www.opensourceconnections.com/>
| My Free/Busy <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless of
whether attachments are marked as such.
>
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com <http://www.opensourceconnections.com/>
| My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless of
whether attachments are marked as such.
--
Spicule Limited is registered in England & Wales. Company Number:
09954122. Registered office: First Floor, Telecom House, 125-135 Preston
Road, Brighton, England, BN1 6AF. VAT No. 251478891.
All engagements
are subject to Spicule Terms and Conditions of Business. This email and its
contents are intended solely for the individual to whom it is addressed and
may contain information that is confidential, privileged or otherwise
protected from disclosure, distributing or copying. Any views or opinions
presented in this email are solely those of the author and do not
necessarily represent those of Spicule Limited. The company accepts no
liability for any damage caused by any virus transmitted by this email. If
you have received this message in error, please notify us immediately by
reply email before deleting it from your system. Service of legal notice
cannot be effected on Spicule Limited by email.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Eric Pugh <ep...@opensourceconnections.com>.
Hi folks!
I’ve got a mostly completed PR for having install scripts for Tika Server, and I’m hoping a committer will take a look at the PR, and give feedback (and ideally commit in time for 1.24!)
A couple of things:
1) This was completely influenced by https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>, in fact I started with the Solr scripts.
2) I’ve deleted all the Solr specific aspects (I think), however there may still be more to delete.
3) This requires a change to how we release Tika, previously we ship tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we want to add the tika-server-bin.tgz and tika-server-bin.zip binary distributions.
I’m happy to start writing accompanying “how to deploy Tika Server” docs if this PR looks good! Or, please give input and I’ll make the updates.
Eric
> On Dec 12, 2019, at 2:39 PM, Eric Pugh <ep...@opensourceconnections.com> wrote:
>
> I’ve created this JIRA to track this work: https://issues.apache.org/jira/browse/TIKA-3010 <https://issues.apache.org/jira/browse/TIKA-3010>
>
> And a WIP progress PR is at https://github.com/apache/tika/pull/305 <https://github.com/apache/tika/pull/305>
>
> My thought is to put something together that mimics how we deploy Solr, and see how that works. I have a need for an install process that a general IT person can follow, who isn’t a Tika expert or a Docker users.
>
>
>
>
>> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org <ma...@apache.org>> wrote:
>>
>> Thanks for bringing this conversation up Eric.
>>
>>
>>
>> Historically if you look over the last 5 years, I think what you are asking below has sort of already become the de facto
>> truth. Most people are in fact using Tika server, whether they are individual devs, govvies, commercial folk and the like.
>>
>> Big, small and medium projects. Evidenced by the expansion of Tika APIs into pretty much every PL I know and use of
>> actively today.
>>
>>
>>
>> Given that, we probably should update the main website docs to make this more prominent. The tika server docs on the
>> wiki are pretty darn good. But they don’t get prime real estate. Would be wonderful if someone wants to update the
>> website to make it more prominent.
>>
>>
>>
>> The downstream Tika Python lib that I maintain has tons of activity is used by more than 350+ projects and relies solely
>> on Tika-Server. My recommendation to the Solr folks (having created 7633) from the 2014 DARPA MEMEX days was to
>> move towards Tika Server based SolrCell dep and that’s the right way to go IMO.
>>
>>
>>
>> Chris
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> From: Eric Pugh <epugh@opensourceconnections.com <ma...@opensourceconnections.com>>
>> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org>" <dev@tika.apache.org <ma...@tika.apache.org>>
>> Date: Wednesday, December 4, 2019 at 12:24 PM
>> To: "tika-dev@apache.org <ma...@apache.org>" <tika-dev@apache.org <ma...@apache.org>>
>> Subject: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
>>
>>
>>
>> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
>>
>>
>>
>> Over in Solr land there has been renewed discussion about streamlining what Solr is....
>>
>>
>>
>> In regards to rich content extraction and the Tika project, it seems like the two ideas that continue to preserve the existing behavior are:
>>
>>
>>
>> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr. This slims down the standard Solr download, and *might* make it easier to update the version of Tika + dependent jars used?
>>
>>
>>
>> 2) The second approach is to instead require Tika-Server to be running (https://issues.apache.org/jira/browse/SOLR-7633 <https://issues.apache.org/jira/browse/SOLR-7633>) and just have Solr delegate the call to Tika-Server.
>>
>>
>>
>>
>>
>> I was thinking about why I like option 1 better than 2, and I think it boils down to how mature the IT organization I am working with is. Some IT organizations have large dev-ops teams, and are working at major scale, and managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically scaling up and down is simple and second nature! However, many organizations aren’t like that.
>>
>>
>>
>> So I guess what I’m asking is do we have a reasonable supported approach for deploying Tika Server for non-tika savvy organizations? I’m thinking about Solr, and specifically the fact that Solr has a well defined set of Service Installation scripts. When I follow the directions in https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production> I can feel confident that when the server is rebooted, then Solr will come back up! Plus there is log rotation and all the rest.
>>
>>
>>
>> In contrast, when I look at Tika website, specifically https://tika.apache.org/1.22/gettingstarted.htm <https://tika.apache.org/1.22/gettingstarted.htm> pagel, the message is to run Tika as a command line application, or embedded in your application.
>>
>>
>>
>> I’m wondering if Tika-Server needs to be made more prominent, and treated as the “primary method of interacting with Tika”? Do we need as a community to focus more on Tika-Server? In our getting started documentation, in our usage documentation, and in our examples?
>>
>>
>>
>> Do we need to create the equivalent of the Service Installation scripts for Tika-Server?
>>
>>
>>
>> Wanted to stoke the discussion!
>>
>>
>>
>> Eric
>>
>>
>>
>> _______________________
>>
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>> | My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>>
>> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Eric Pugh <ep...@opensourceconnections.com>.
I’ve created this JIRA to track this work: https://issues.apache.org/jira/browse/TIKA-3010 <https://issues.apache.org/jira/browse/TIKA-3010>
And a WIP progress PR is at https://github.com/apache/tika/pull/305
My thought is to put something together that mimics how we deploy Solr, and see how that works. I have a need for an install process that a general IT person can follow, who isn’t a Tika expert or a Docker users.
> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <ma...@apache.org> wrote:
>
> Thanks for bringing this conversation up Eric.
>
>
>
> Historically if you look over the last 5 years, I think what you are asking below has sort of already become the de facto
> truth. Most people are in fact using Tika server, whether they are individual devs, govvies, commercial folk and the like.
>
> Big, small and medium projects. Evidenced by the expansion of Tika APIs into pretty much every PL I know and use of
> actively today.
>
>
>
> Given that, we probably should update the main website docs to make this more prominent. The tika server docs on the
> wiki are pretty darn good. But they don’t get prime real estate. Would be wonderful if someone wants to update the
> website to make it more prominent.
>
>
>
> The downstream Tika Python lib that I maintain has tons of activity is used by more than 350+ projects and relies solely
> on Tika-Server. My recommendation to the Solr folks (having created 7633) from the 2014 DARPA MEMEX days was to
> move towards Tika Server based SolrCell dep and that’s the right way to go IMO.
>
>
>
> Chris
>
>
>
>
>
>
>
>
>
>
>
> From: Eric Pugh <epugh@opensourceconnections.com <ma...@opensourceconnections.com>>
> Reply-To: "dev@tika.apache.org <ma...@tika.apache.org>" <dev@tika.apache.org <ma...@tika.apache.org>>
> Date: Wednesday, December 4, 2019 at 12:24 PM
> To: "tika-dev@apache.org <ma...@apache.org>" <tika-dev@apache.org <ma...@apache.org>>
> Subject: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
>
>
>
> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
>
>
>
> Over in Solr land there has been renewed discussion about streamlining what Solr is....
>
>
>
> In regards to rich content extraction and the Tika project, it seems like the two ideas that continue to preserve the existing behavior are:
>
>
>
> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr. This slims down the standard Solr download, and *might* make it easier to update the version of Tika + dependent jars used?
>
>
>
> 2) The second approach is to instead require Tika-Server to be running (https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate the call to Tika-Server.
>
>
>
>
>
> I was thinking about why I like option 1 better than 2, and I think it boils down to how mature the IT organization I am working with is. Some IT organizations have large dev-ops teams, and are working at major scale, and managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically scaling up and down is simple and second nature! However, many organizations aren’t like that.
>
>
>
> So I guess what I’m asking is do we have a reasonable supported approach for deploying Tika Server for non-tika savvy organizations? I’m thinking about Solr, and specifically the fact that Solr has a well defined set of Service Installation scripts. When I follow the directions in https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production I can feel confident that when the server is rebooted, then Solr will come back up! Plus there is log rotation and all the rest.
>
>
>
> In contrast, when I look at Tika website, specifically https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run Tika as a command line application, or embedded in your application.
>
>
>
> I’m wondering if Tika-Server needs to be made more prominent, and treated as the “primary method of interacting with Tika”? Do we need as a community to focus more on Tika-Server? In our getting started documentation, in our usage documentation, and in our examples?
>
>
>
> Do we need to create the equivalent of the Service Installation scripts for Tika-Server?
>
>
>
> Wanted to stoke the discussion!
>
>
>
> Eric
>
>
>
> _______________________
>
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>> | My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>
> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: [EXTERNAL] Do we have a community supported approach for
deploying Tika Server in production?
Posted by Chris Mattmann <ma...@apache.org>.
Thanks for bringing this conversation up Eric.
Historically if you look over the last 5 years, I think what you are asking below has sort of already become the de facto
truth. Most people are in fact using Tika server, whether they are individual devs, govvies, commercial folk and the like.
Big, small and medium projects. Evidenced by the expansion of Tika APIs into pretty much every PL I know and use of
actively today.
Given that, we probably should update the main website docs to make this more prominent. The tika server docs on the
wiki are pretty darn good. But they don’t get prime real estate. Would be wonderful if someone wants to update the
website to make it more prominent.
The downstream Tika Python lib that I maintain has tons of activity is used by more than 350+ projects and relies solely
on Tika-Server. My recommendation to the Solr folks (having created 7633) from the 2014 DARPA MEMEX days was to
move towards Tika Server based SolrCell dep and that’s the right way to go IMO.
Chris
From: Eric Pugh <ep...@opensourceconnections.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, December 4, 2019 at 12:24 PM
To: "tika-dev@apache.org" <ti...@apache.org>
Subject: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
Over in Solr land there has been renewed discussion about streamlining what Solr is....
In regards to rich content extraction and the Tika project, it seems like the two ideas that continue to preserve the existing behavior are:
1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr. This slims down the standard Solr download, and *might* make it easier to update the version of Tika + dependent jars used?
2) The second approach is to instead require Tika-Server to be running (https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate the call to Tika-Server.
I was thinking about why I like option 1 better than 2, and I think it boils down to how mature the IT organization I am working with is. Some IT organizations have large dev-ops teams, and are working at major scale, and managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically scaling up and down is simple and second nature! However, many organizations aren’t like that.
So I guess what I’m asking is do we have a reasonable supported approach for deploying Tika Server for non-tika savvy organizations? I’m thinking about Solr, and specifically the fact that Solr has a well defined set of Service Installation scripts. When I follow the directions in https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production I can feel confident that when the server is rebooted, then Solr will come back up! Plus there is log rotation and all the rest.
In contrast, when I look at Tika website, specifically https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run Tika as a command line application, or embedded in your application.
I’m wondering if Tika-Server needs to be made more prominent, and treated as the “primary method of interacting with Tika”? Do we need as a community to focus more on Tika-Server? In our getting started documentation, in our usage documentation, and in our examples?
Do we need to create the equivalent of the Service Installation scripts for Tika-Server?
Wanted to stoke the discussion!
Eric
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.