You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Chris Mattmann <ma...@apache.org> on 2019/11/21 00:02:23 UTC

Re: [EXTERNAL] Re: Docker image along with 1.23?

Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply shipping text file, 
code. Under a license. If we create a “docker image” and then publish it to the ASF 
hub then I agree with you.

 

My suggestion and my interpretation of Tim’s is to ship a standard “Dockerfile”. Do you
agree with this? It should be air covered (as former VP, Legal, at least it would have been
with me). 

 

Cheers,

Chris

 

 

 

 

From: Nick Burch <ap...@gagravarr.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, November 20, 2019 at 3:57 PM
To: "Allison, Timothy B (US 1760-Affiliate)" <ti...@jpl.nasa.gov>
Cc: "<de...@tika.apache.org>" <de...@tika.apache.org>
Subject: [EXTERNAL] Re: Docker image along with 1.23?

 

On Wed, 20 Nov 2019, Tim Allison wrote:

Eric Pugh recently asked on another channel if we had any plans to

release an official docker image for 1.23.

 

Depending on what we put in the container, we do need to be a little 

careful. There's "platform dependencies" under non-compatible licenses 

that we can optionally use if people have installed them, which we 

ourselves can't directly ship under ASF rules. (Tesseract is fine as 

that's Apache Licenses, Java itself is trickier, see the Netbeans 

discussions on legal-discuss@ and LEGAL jira)

 

Shipping an official docker container with the Tika Server on seems to me 

to be a helpful step for users, but we just need to make sure we're 

following ASF policies. (The Apache Software Foundation mission is to 

"provide software for the public good", but source code is the main focus 

for the mission, binaries are trickier!)

 

Nick

 


Re: [EXTERNAL] Docker image along with 1.23?

Posted by Tim Allison <ta...@apache.org>.
K.  Sounds like an example Docker file will meet your needs, Eric?

Users can currently build their own images with the Docker file in
tika-server, and there's logical-spark.

As noted, there are complexities with distributing an image.

Between those two options, folks should basically be ok.  Right?

I might want to add an advanced Docker file example on our wiki  (or
perhaps in logical-spark ???) that:
1) runs tika-server in spawn-child mode
2) returns stack-traces
3) includes the "provided" xerial sqlite jar
4) includes non ASL 2.0 compatible dependencies for image processing in PDFs

Anything else?



On Thu, Nov 21, 2019 at 7:10 AM Eric Pugh <ep...@opensourceconnections.com>
wrote:

> That makes sense.   Having a robust Dockerfile, even if it isn’t
> published, is a great way of modeling best practices in running Tika in
> server mode.
>
>
>
> > On Nov 21, 2019, at 3:26 AM, Nick Burch <ap...@gagravarr.org> wrote:
> >
> > On Thu, 21 Nov 2019, Oleg Tikhonov wrote:
> >> My question is more pragmatic.
> >> What we put inside the Dockerfile, on which image it will be based on
> (say
> >> Ubuntu) ...
> >> What will contain an entrypoint? Tika Server? Should we "install" a
> >> tesseract? Anything more?
> >
> > If we want to be trendy, then Sergey Beryozkin did some cool stuck with
> Quarkus and a GraalVM native image of Tika, video online at
> >
> https://aceu19.apachecon.com/session/apache-tika-goes-native-graalvm-and-quarkus
> >
> > I'd possibly suggest two dockerfiles (but not published images!), both
> based on a fairly thin common Java base image (so probably ubuntu rather
> than alphine). One with just Tika Server + tesseract + english tesseract
> data, one with all the optional Tika dependencies (sql natives libraries
> etc) and tesseract and all the available tesseract languages
> >
> > Some other projects are currently leading the debate on ASF binary
> releases that bundle the JVM, I'd suggest we wait for that to resolve
> before we think about trying to publish pre-built images ourselves. Linking
> to images from external organisations we trust should be fine though, eg
> similar to http://httpd.apache.org/docs/current/platform/windows.html#down
> >
> > Nick
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

Re: [EXTERNAL] Docker image along with 1.23?

Posted by Eric Pugh <ep...@opensourceconnections.com>.
That makes sense.   Having a robust Dockerfile, even if it isn’t published, is a great way of modeling best practices in running Tika in server mode.



> On Nov 21, 2019, at 3:26 AM, Nick Burch <ap...@gagravarr.org> wrote:
> 
> On Thu, 21 Nov 2019, Oleg Tikhonov wrote:
>> My question is more pragmatic.
>> What we put inside the Dockerfile, on which image it will be based on (say
>> Ubuntu) ...
>> What will contain an entrypoint? Tika Server? Should we "install" a
>> tesseract? Anything more?
> 
> If we want to be trendy, then Sergey Beryozkin did some cool stuck with Quarkus and a GraalVM native image of Tika, video online at
> https://aceu19.apachecon.com/session/apache-tika-goes-native-graalvm-and-quarkus
> 
> I'd possibly suggest two dockerfiles (but not published images!), both based on a fairly thin common Java base image (so probably ubuntu rather than alphine). One with just Tika Server + tesseract + english tesseract data, one with all the optional Tika dependencies (sql natives libraries etc) and tesseract and all the available tesseract languages
> 
> Some other projects are currently leading the debate on ASF binary releases that bundle the JVM, I'd suggest we wait for that to resolve before we think about trying to publish pre-built images ourselves. Linking to images from external organisations we trust should be fine though, eg similar to http://httpd.apache.org/docs/current/platform/windows.html#down
> 
> Nick

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Re: [EXTERNAL] Docker image along with 1.23?

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 21 Nov 2019, Oleg Tikhonov wrote:
> My question is more pragmatic.
> What we put inside the Dockerfile, on which image it will be based on (say
> Ubuntu) ...
> What will contain an entrypoint? Tika Server? Should we "install" a
> tesseract? Anything more?

If we want to be trendy, then Sergey Beryozkin did some cool stuck with 
Quarkus and a GraalVM native image of Tika, video online at
https://aceu19.apachecon.com/session/apache-tika-goes-native-graalvm-and-quarkus

I'd possibly suggest two dockerfiles (but not published images!), both 
based on a fairly thin common Java base image (so probably ubuntu rather 
than alphine). One with just Tika Server + tesseract + english tesseract 
data, one with all the optional Tika dependencies (sql natives libraries 
etc) and tesseract and all the available tesseract languages

Some other projects are currently leading the debate on ASF binary 
releases that bundle the JVM, I'd suggest we wait for that to resolve 
before we think about trying to publish pre-built images ourselves. 
Linking to images from external organisations we trust should be fine 
though, eg similar to 
http://httpd.apache.org/docs/current/platform/windows.html#down

Nick

Re: [EXTERNAL] Docker image along with 1.23?

Posted by Oleg Tikhonov <ol...@apache.org>.
My question is more pragmatic.
What we put inside the Dockerfile, on which image it will be based on (say
Ubuntu) ...
What will contain an entrypoint? Tika Server? Should we "install" a
tesseract? Anything more?

Thanks,
Oleg

On Thu, Nov 21, 2019 at 4:46 AM Chris Mattmann <ma...@apache.org> wrote:

> Yeah producing the actual image is tricky and my recommendation is for
> Tika to
> stay out of the business of that. Leave it to LogicalSpark or others to do
> this. It’s
> tricky with licenses and I doubt ASF will ever develop an optimal solution
> to this
> due to the nature of its core mission as Nick stated.
>
>
>
>
>
>
>
>
>
> From: Eric Pugh <ep...@opensourceconnections.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Wednesday, November 20, 2019 at 6:02 PM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Cc: "Allison, Timothy B (US 1760-Affiliate)" <
> timothy.b.allison@jpl.nasa.gov>
> Subject: Re: [EXTERNAL] Docker image along with 1.23?
>
>
>
> I was thinking more of producing the actual image, so that others don’t
> have to go through the pain of compiling an image.   Having the Dockerfile
> made available as well does give a nice recipe for modifying the “official”
> image.   I recently tested Tesseract 3 with the latest Tika, and I did it
> by tweaking the existing Dockerfile that LogicalSpark has published.
>
>
>
> I don’t know how other projects at ASF handle the image publishing.
>
>
>
>
>
>
>
>
>
> On Nov 20, 2019, at 7:02 PM, Chris Mattmann <ma...@apache.org> wrote:
>
> Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply
> shipping text file,
>
> code. Under a license. If we create a “docker image” and then publish it
> to the ASF
>
> hub then I agree with you.
>
> My suggestion and my interpretation of Tim’s is to ship a standard
> “Dockerfile”. Do you
>
> agree with this? It should be air covered (as former VP, Legal, at least
> it would have been
>
> with me).
>
> Cheers,
>
> Chris
>
> From: Nick Burch <ap...@gagravarr.org>
>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>
> Date: Wednesday, November 20, 2019 at 3:57 PM
>
> To: "Allison, Timothy B (US 1760-Affiliate)" <
> timothy.b.allison@jpl.nasa.gov>
>
> Cc: "<de...@tika.apache.org>" <de...@tika.apache.org>
>
> Subject: [EXTERNAL] Re: Docker image along with 1.23?
>
> On Wed, 20 Nov 2019, Tim Allison wrote:
>
> Eric Pugh recently asked on another channel if we had any plans to
>
> release an official docker image for 1.23.
>
> Depending on what we put in the container, we do need to be a little
>
> careful. There's "platform dependencies" under non-compatible licenses
>
> that we can optionally use if people have installed them, which we
>
> ourselves can't directly ship under ASF rules. (Tesseract is fine as
>
> that's Apache Licenses, Java itself is trickier, see the Netbeans
>
> discussions on legal-discuss@ and LEGAL jira)
>
> Shipping an official docker container with the Tika Server on seems to me
>
> to be a helpful step for users, but we just need to make sure we're
>
> following ASF policies. (The Apache Software Foundation mission is to
>
> "provide software for the public good", but source code is the main focus
>
> for the mission, binaries are trickier!)
>
> Nick
>
>
>
> _______________________
>
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>
>
>
>
>

Re: [EXTERNAL] Docker image along with 1.23?

Posted by Chris Mattmann <ma...@apache.org>.
Yeah producing the actual image is tricky and my recommendation is for Tika to 
stay out of the business of that. Leave it to LogicalSpark or others to do this. It’s 
tricky with licenses and I doubt ASF will ever develop an optimal solution to this 
due to the nature of its core mission as Nick stated.

 

 

 

 

From: Eric Pugh <ep...@opensourceconnections.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, November 20, 2019 at 6:02 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Cc: "Allison, Timothy B (US 1760-Affiliate)" <ti...@jpl.nasa.gov>
Subject: Re: [EXTERNAL] Docker image along with 1.23?

 

I was thinking more of producing the actual image, so that others don’t have to go through the pain of compiling an image.   Having the Dockerfile made available as well does give a nice recipe for modifying the “official” image.   I recently tested Tesseract 3 with the latest Tika, and I did it by tweaking the existing Dockerfile that LogicalSpark has published.

 

I don’t know how other projects at ASF handle the image publishing.

 

 

 

 

On Nov 20, 2019, at 7:02 PM, Chris Mattmann <ma...@apache.org> wrote:

Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply shipping text file, 

code. Under a license. If we create a “docker image” and then publish it to the ASF 

hub then I agree with you.

My suggestion and my interpretation of Tim’s is to ship a standard “Dockerfile”. Do you

agree with this? It should be air covered (as former VP, Legal, at least it would have been

with me). 

Cheers,

Chris

From: Nick Burch <ap...@gagravarr.org>

Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>

Date: Wednesday, November 20, 2019 at 3:57 PM

To: "Allison, Timothy B (US 1760-Affiliate)" <ti...@jpl.nasa.gov>

Cc: "<de...@tika.apache.org>" <de...@tika.apache.org>

Subject: [EXTERNAL] Re: Docker image along with 1.23?

On Wed, 20 Nov 2019, Tim Allison wrote:

Eric Pugh recently asked on another channel if we had any plans to

release an official docker image for 1.23.

Depending on what we put in the container, we do need to be a little 

careful. There's "platform dependencies" under non-compatible licenses 

that we can optionally use if people have installed them, which we 

ourselves can't directly ship under ASF rules. (Tesseract is fine as 

that's Apache Licenses, Java itself is trickier, see the Netbeans 

discussions on legal-discuss@ and LEGAL jira)

Shipping an official docker container with the Tika Server on seems to me 

to be a helpful step for users, but we just need to make sure we're 

following ASF policies. (The Apache Software Foundation mission is to 

"provide software for the public good", but source code is the main focus 

for the mission, binaries are trickier!)

Nick

 

_______________________

Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  

Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>       

This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

 

 


Re: [EXTERNAL] Docker image along with 1.23?

Posted by Eric Pugh <ep...@opensourceconnections.com>.
I was thinking more of producing the actual image, so that others don’t have to go through the pain of compiling an image.   Having the Dockerfile made available as well does give a nice recipe for modifying the “official” image.   I recently tested Tesseract 3 with the latest Tika, and I did it by tweaking the existing Dockerfile that LogicalSpark has published.

I don’t know how other projects at ASF handle the image publishing.




> On Nov 20, 2019, at 7:02 PM, Chris Mattmann <ma...@apache.org> wrote:
> 
> Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply shipping text file, 
> code. Under a license. If we create a “docker image” and then publish it to the ASF 
> hub then I agree with you.
> 
> 
> 
> My suggestion and my interpretation of Tim’s is to ship a standard “Dockerfile”. Do you
> agree with this? It should be air covered (as former VP, Legal, at least it would have been
> with me). 
> 
> 
> 
> Cheers,
> 
> Chris
> 
> 
> 
> 
> 
> 
> 
> 
> 
> From: Nick Burch <ap...@gagravarr.org>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Wednesday, November 20, 2019 at 3:57 PM
> To: "Allison, Timothy B (US 1760-Affiliate)" <ti...@jpl.nasa.gov>
> Cc: "<de...@tika.apache.org>" <de...@tika.apache.org>
> Subject: [EXTERNAL] Re: Docker image along with 1.23?
> 
> 
> 
> On Wed, 20 Nov 2019, Tim Allison wrote:
> 
> Eric Pugh recently asked on another channel if we had any plans to
> 
> release an official docker image for 1.23.
> 
> 
> 
> Depending on what we put in the container, we do need to be a little 
> 
> careful. There's "platform dependencies" under non-compatible licenses 
> 
> that we can optionally use if people have installed them, which we 
> 
> ourselves can't directly ship under ASF rules. (Tesseract is fine as 
> 
> that's Apache Licenses, Java itself is trickier, see the Netbeans 
> 
> discussions on legal-discuss@ and LEGAL jira)
> 
> 
> 
> Shipping an official docker container with the Tika Server on seems to me 
> 
> to be a helpful step for users, but we just need to make sure we're 
> 
> following ASF policies. (The Apache Software Foundation mission is to 
> 
> "provide software for the public good", but source code is the main focus 
> 
> for the mission, binaries are trickier!)
> 
> 
> 
> Nick
> 
> 
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.