You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2017/05/31 19:33:00 UTC

experiences with Tika in Docker

Dave Meikle, Tom and All,

    How many of us are using Tika in Docker?  If so, how exactly are you using it?  Single instance, swarm, Kubernetes, something else?  People fear I/O hit with tika-server...what are your experiences?
I really like the ability to limit the number of CPUs in the Docker container.  If a single doc causes multithreaded gc to go nuts, that won't kill an entire machine.  This also cleanly limits the risk from XXE or arbitrary code execution, right?

If this is one of the ways of the future for big data, we might want to look into hardening tika-server (OOMs, timeouts).  What do you all think?

        Cheers,

                Tim

Timothy B. Allison, Ph.D.
Principal Artificial Intelligence Engineer
Group Lead
K83E/Human Language Technology
The MITRE Corporation
7515 Colshire Drive, McLean, VA  22102
703-983-2473 (phone); 703-983-1379 (fax)


Re: experiences with Tika in Docker

Posted by Tom Barber <ma...@apache.org>.
Yeah the encapsulation of the service is pretty darn useful. You can also
start thinking about loadbalancing and autoscaling for high volume stuff
spin up many identical dockers, distribute the workload and shut them all
down again to free up resource.

I also have a Snappy package for Tika I can commit up to your guys if
you're interested which will allow you to do `snap install tika` on most
mainstream Linux distros like you would a deb or rpm, but the benifit of
that is you also get automated updates and rollback along with (and more
usefully) software isolation and encapsulation.

Tom

On Fri, Jun 2, 2017 at 9:13 AM, Oleg Tikhonov <ol...@apache.org> wrote:

> Guys, i can help with Tika dockerization. just let design/plan what we
> gonna do.
>
> On Thu, Jun 1, 2017 at 4:02 PM, Eric Pugh <epugh@opensourceconnections.com
> >
> wrote:
>
> > As the Tika project starts embracing more non Java tools (I’m thinking of
> > Tesseract for example), dockerizing your Tika setup becomes more and more
> > valuable.
> >
> > For example, I run my tests for my application on my local Mac, as well
> as
> > on CircleCI.   I have a dockeriezed Tika service that does the OCR stuff,
> > and I know it’s the same work on both.   It’s less exciting if I’m in an
> > “all Java” world.
> >
> >
> > > On Jun 1, 2017, at 7:55 AM, Allison, Timothy B. <ta...@mitre.org>
> > wrote:
> > >
> > > Thank you, Thejan!
> > >
> > > -----Original Message-----
> > > From: Thejan Wijesinghe [mailto:thejan.k.wijesinghe@gmail.com]
> > > Sent: Wednesday, May 31, 2017 5:40 PM
> > > To: dev@tika.apache.org
> > > Subject: Re: experiences with Tika in Docker
> > >
> > > Hi Tim,
> > >
> > > I've used Tika -server in docker but as a single instance only. Yes,
> its
> > ability to limit container's resources with related to memory & CPU in
> the
> > host machine is great, it gives us so much flexibility, we could enforce
> > hard/soft memory limits, we could even manipulate the host machine's CPU
> > cycles. Yes, it also limits risks of executing arbitrary code & XXE
> > vulnerabilities. I already asked Prof. Chris Mattmann about officially
> > moving to dockerhub. He said I need to make a mail to apache infra asking
> > about this. Unfortunately, I still couldn't find a time to make that
> mail.
> > >
> > > We already have multiple dockerfiles in Tika, , dockerfile in
> > tika-server, InceptionRestDockerfile, InceptionVideoRestDockerfile,
> > Im2txtRestDockerfile(PR #180-for image captioning).
> > >
> > > Part of my GSoC project is to unify the existing REST services such as
> > object recognition, image captioning. My idea is to unify all of those
> REST
> > services where the user can start/terminate, see statistics of any REST
> > service through a web based GUI. I'm expecting to use a fusion of
> nginx(as
> > the reverse proxy server) & docker to make it work. So obviously we will
> > see docker much often in Tika.
> > >
> > > +1 for your thought to looking into hardening the tika-server with the
> > > +help
> > > of docker.
> > >
> > > best,
> > > ThejanW
> > >
> > > On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. <
> tallison@mitre.org>
> > > wrote:
> > >
> > >> Dave Meikle, Tom and All,
> > >>
> > >>    How many of us are using Tika in Docker?  If so, how exactly are
> > >> you using it?  Single instance, swarm, Kubernetes, something else?
> > >> People fear I/O hit with tika-server...what are your experiences?
> > >> I really like the ability to limit the number of CPUs in the Docker
> > >> container.  If a single doc causes multithreaded gc to go nuts, that
> > >> won't kill an entire machine.  This also cleanly limits the risk from
> > >> XXE or arbitrary code execution, right?
> > >>
> > >> If this is one of the ways of the future for big data, we might want
> > >> to look into hardening tika-server (OOMs, timeouts).  What do you all
> > think?
> > >>
> > >>        Cheers,
> > >>
> > >>                Tim
> > >>
> > >> Timothy B. Allison, Ph.D.
> > >> Principal Artificial Intelligence Engineer Group Lead K83E/Human
> > >> Language Technology The MITRE Corporation
> > >> 7515 Colshire Drive, McLean, VA  22102
> > >> 703-983-2473 (phone); 703-983-1379 (fax)
> > >>
> > >>
> >
> >
> > _______________________
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> > http://www.opensourceconnections.com <http://www.
> > opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-
> > enterprise-search-server-third-edition-raw>
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless of
> > whether attachments are marked as such.
> >
> >
>

Re: experiences with Tika in Docker

Posted by Oleg Tikhonov <ol...@apache.org>.
Guys, i can help with Tika dockerization. just let design/plan what we
gonna do.

On Thu, Jun 1, 2017 at 4:02 PM, Eric Pugh <ep...@opensourceconnections.com>
wrote:

> As the Tika project starts embracing more non Java tools (I’m thinking of
> Tesseract for example), dockerizing your Tika setup becomes more and more
> valuable.
>
> For example, I run my tests for my application on my local Mac, as well as
> on CircleCI.   I have a dockeriezed Tika service that does the OCR stuff,
> and I know it’s the same work on both.   It’s less exciting if I’m in an
> “all Java” world.
>
>
> > On Jun 1, 2017, at 7:55 AM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
> >
> > Thank you, Thejan!
> >
> > -----Original Message-----
> > From: Thejan Wijesinghe [mailto:thejan.k.wijesinghe@gmail.com]
> > Sent: Wednesday, May 31, 2017 5:40 PM
> > To: dev@tika.apache.org
> > Subject: Re: experiences with Tika in Docker
> >
> > Hi Tim,
> >
> > I've used Tika -server in docker but as a single instance only. Yes, its
> ability to limit container's resources with related to memory & CPU in the
> host machine is great, it gives us so much flexibility, we could enforce
> hard/soft memory limits, we could even manipulate the host machine's CPU
> cycles. Yes, it also limits risks of executing arbitrary code & XXE
> vulnerabilities. I already asked Prof. Chris Mattmann about officially
> moving to dockerhub. He said I need to make a mail to apache infra asking
> about this. Unfortunately, I still couldn't find a time to make that mail.
> >
> > We already have multiple dockerfiles in Tika, , dockerfile in
> tika-server, InceptionRestDockerfile, InceptionVideoRestDockerfile,
> Im2txtRestDockerfile(PR #180-for image captioning).
> >
> > Part of my GSoC project is to unify the existing REST services such as
> object recognition, image captioning. My idea is to unify all of those REST
> services where the user can start/terminate, see statistics of any REST
> service through a web based GUI. I'm expecting to use a fusion of nginx(as
> the reverse proxy server) & docker to make it work. So obviously we will
> see docker much often in Tika.
> >
> > +1 for your thought to looking into hardening the tika-server with the
> > +help
> > of docker.
> >
> > best,
> > ThejanW
> >
> > On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. <ta...@mitre.org>
> > wrote:
> >
> >> Dave Meikle, Tom and All,
> >>
> >>    How many of us are using Tika in Docker?  If so, how exactly are
> >> you using it?  Single instance, swarm, Kubernetes, something else?
> >> People fear I/O hit with tika-server...what are your experiences?
> >> I really like the ability to limit the number of CPUs in the Docker
> >> container.  If a single doc causes multithreaded gc to go nuts, that
> >> won't kill an entire machine.  This also cleanly limits the risk from
> >> XXE or arbitrary code execution, right?
> >>
> >> If this is one of the ways of the future for big data, we might want
> >> to look into hardening tika-server (OOMs, timeouts).  What do you all
> think?
> >>
> >>        Cheers,
> >>
> >>                Tim
> >>
> >> Timothy B. Allison, Ph.D.
> >> Principal Artificial Intelligence Engineer Group Lead K83E/Human
> >> Language Technology The MITRE Corporation
> >> 7515 Colshire Drive, McLean, VA  22102
> >> 703-983-2473 (phone); 703-983-1379 (fax)
> >>
> >>
>
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <http://www.
> opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-
> enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

Re: experiences with Tika in Docker

Posted by Eric Pugh <ep...@opensourceconnections.com>.
As the Tika project starts embracing more non Java tools (I’m thinking of Tesseract for example), dockerizing your Tika setup becomes more and more valuable.   

For example, I run my tests for my application on my local Mac, as well as on CircleCI.   I have a dockeriezed Tika service that does the OCR stuff, and I know it’s the same work on both.   It’s less exciting if I’m in an “all Java” world.

 
> On Jun 1, 2017, at 7:55 AM, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> Thank you, Thejan!
> 
> -----Original Message-----
> From: Thejan Wijesinghe [mailto:thejan.k.wijesinghe@gmail.com] 
> Sent: Wednesday, May 31, 2017 5:40 PM
> To: dev@tika.apache.org
> Subject: Re: experiences with Tika in Docker
> 
> Hi Tim,
> 
> I've used Tika -server in docker but as a single instance only. Yes, its ability to limit container's resources with related to memory & CPU in the host machine is great, it gives us so much flexibility, we could enforce hard/soft memory limits, we could even manipulate the host machine's CPU cycles. Yes, it also limits risks of executing arbitrary code & XXE vulnerabilities. I already asked Prof. Chris Mattmann about officially moving to dockerhub. He said I need to make a mail to apache infra asking about this. Unfortunately, I still couldn't find a time to make that mail.
> 
> We already have multiple dockerfiles in Tika, , dockerfile in tika-server, InceptionRestDockerfile, InceptionVideoRestDockerfile, Im2txtRestDockerfile(PR #180-for image captioning).
> 
> Part of my GSoC project is to unify the existing REST services such as object recognition, image captioning. My idea is to unify all of those REST services where the user can start/terminate, see statistics of any REST service through a web based GUI. I'm expecting to use a fusion of nginx(as the reverse proxy server) & docker to make it work. So obviously we will see docker much often in Tika.
> 
> +1 for your thought to looking into hardening the tika-server with the 
> +help
> of docker.
> 
> best,
> ThejanW
> 
> On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
> 
>> Dave Meikle, Tom and All,
>> 
>>    How many of us are using Tika in Docker?  If so, how exactly are 
>> you using it?  Single instance, swarm, Kubernetes, something else?  
>> People fear I/O hit with tika-server...what are your experiences?
>> I really like the ability to limit the number of CPUs in the Docker 
>> container.  If a single doc causes multithreaded gc to go nuts, that 
>> won't kill an entire machine.  This also cleanly limits the risk from 
>> XXE or arbitrary code execution, right?
>> 
>> If this is one of the ways of the future for big data, we might want 
>> to look into hardening tika-server (OOMs, timeouts).  What do you all think?
>> 
>>        Cheers,
>> 
>>                Tim
>> 
>> Timothy B. Allison, Ph.D.
>> Principal Artificial Intelligence Engineer Group Lead K83E/Human 
>> Language Technology The MITRE Corporation
>> 7515 Colshire Drive, McLean, VA  22102
>> 703-983-2473 (phone); 703-983-1379 (fax)
>> 
>> 


_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.


RE: experiences with Tika in Docker

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Thejan!

-----Original Message-----
From: Thejan Wijesinghe [mailto:thejan.k.wijesinghe@gmail.com] 
Sent: Wednesday, May 31, 2017 5:40 PM
To: dev@tika.apache.org
Subject: Re: experiences with Tika in Docker

Hi Tim,

I've used Tika -server in docker but as a single instance only. Yes, its ability to limit container's resources with related to memory & CPU in the host machine is great, it gives us so much flexibility, we could enforce hard/soft memory limits, we could even manipulate the host machine's CPU cycles. Yes, it also limits risks of executing arbitrary code & XXE vulnerabilities. I already asked Prof. Chris Mattmann about officially moving to dockerhub. He said I need to make a mail to apache infra asking about this. Unfortunately, I still couldn't find a time to make that mail.

We already have multiple dockerfiles in Tika, , dockerfile in tika-server, InceptionRestDockerfile, InceptionVideoRestDockerfile, Im2txtRestDockerfile(PR #180-for image captioning).

Part of my GSoC project is to unify the existing REST services such as object recognition, image captioning. My idea is to unify all of those REST services where the user can start/terminate, see statistics of any REST service through a web based GUI. I'm expecting to use a fusion of nginx(as the reverse proxy server) & docker to make it work. So obviously we will see docker much often in Tika.

+1 for your thought to looking into hardening the tika-server with the 
+help
of docker.

best,
ThejanW

On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Dave Meikle, Tom and All,
>
>     How many of us are using Tika in Docker?  If so, how exactly are 
> you using it?  Single instance, swarm, Kubernetes, something else?  
> People fear I/O hit with tika-server...what are your experiences?
> I really like the ability to limit the number of CPUs in the Docker 
> container.  If a single doc causes multithreaded gc to go nuts, that 
> won't kill an entire machine.  This also cleanly limits the risk from 
> XXE or arbitrary code execution, right?
>
> If this is one of the ways of the future for big data, we might want 
> to look into hardening tika-server (OOMs, timeouts).  What do you all think?
>
>         Cheers,
>
>                 Tim
>
> Timothy B. Allison, Ph.D.
> Principal Artificial Intelligence Engineer Group Lead K83E/Human 
> Language Technology The MITRE Corporation
> 7515 Colshire Drive, McLean, VA  22102
> 703-983-2473 (phone); 703-983-1379 (fax)
>
>

Re: experiences with Tika in Docker

Posted by Thejan Wijesinghe <th...@gmail.com>.
Hi Tim,

I've used Tika -server in docker but as a single instance only. Yes, its
ability to limit container's resources with related to memory & CPU in the
host machine is great, it gives us so much flexibility, we could enforce
hard/soft memory limits, we could even manipulate the host machine's CPU
cycles. Yes, it also limits risks of executing arbitrary code & XXE
vulnerabilities. I already asked Prof. Chris Mattmann about officially
moving to dockerhub. He said I need to make a mail to apache infra asking
about this. Unfortunately, I still couldn't find a time to make that mail.

We already have multiple dockerfiles in Tika, , dockerfile in tika-server,
InceptionRestDockerfile, InceptionVideoRestDockerfile,
Im2txtRestDockerfile(PR #180-for image captioning).

Part of my GSoC project is to unify the existing REST services such as
object recognition, image captioning. My idea is to unify all of those REST
services where the user can start/terminate, see statistics of any REST
service through a web based GUI. I'm expecting to use a fusion of nginx(as
the reverse proxy server) & docker to make it work. So obviously we will
see docker much often in Tika.

+1 for your thought to looking into hardening the tika-server with the help
of docker.

best,
ThejanW

On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Dave Meikle, Tom and All,
>
>     How many of us are using Tika in Docker?  If so, how exactly are you
> using it?  Single instance, swarm, Kubernetes, something else?  People fear
> I/O hit with tika-server...what are your experiences?
> I really like the ability to limit the number of CPUs in the Docker
> container.  If a single doc causes multithreaded gc to go nuts, that won't
> kill an entire machine.  This also cleanly limits the risk from XXE or
> arbitrary code execution, right?
>
> If this is one of the ways of the future for big data, we might want to
> look into hardening tika-server (OOMs, timeouts).  What do you all think?
>
>         Cheers,
>
>                 Tim
>
> Timothy B. Allison, Ph.D.
> Principal Artificial Intelligence Engineer
> Group Lead
> K83E/Human Language Technology
> The MITRE Corporation
> 7515 Colshire Drive, McLean, VA  22102
> 703-983-2473 (phone); 703-983-1379 (fax)
>
>