You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Robert Raines <rr...@nisos.com> on 2020/11/23 20:36:38 UTC

Why does Tika offer a client-server option?

Hi,

I am using Tika to extract text from Word Docs and PDFs locally. It's
great. Thank you Apache and Tika developers!

Could someone help me understand why Tika offers a client-server option
instead of just a code library? I am sure there was/is a good reason, so I
am curious if anyone knows or if there are some resources that explain the
history of how/why Tika also has its API architecture.

Thanks so much,
Robert

Re: Why does Tika offer a client-server option?

Posted by Tim Allison <ta...@apache.org>.
Hi Robert,

  Thank you for the note. You can call Tika programmatically if you'd like
with Java.  Some examples are available here:
https://tika.apache.org/1.24.1/examples.html

  One of the best reasons to use tika via tika-server is that you isolate
potential catastrophic problems in another jvm.  If you aren't a heavy
user, you're unlikely to run into problems that will cause timeouts/out of
memory exceptions, but if you do run Tika against
thousands/millions/billions of untrusted files, you'll likely hit one of
these problems.

  What are you trying to achieve with the library vs tika-server?

    Cheers,

         Tim

On Mon, Nov 23, 2020 at 3:37 PM Robert Raines <rr...@nisos.com> wrote:

> Hi,
>
> I am using Tika to extract text from Word Docs and PDFs locally. It's
> great. Thank you Apache and Tika developers!
>
> Could someone help me understand why Tika offers a client-server option
> instead of just a code library? I am sure there was/is a good reason, so I
> am curious if anyone knows or if there are some resources that explain the
> history of how/why Tika also has its API architecture.
>
> Thanks so much,
> Robert
>
>
>
>

Re: Why does Tika offer a client-server option?

Posted by Adam Rauch <ad...@labkey.com>.
Another option is to load and invoke Tika in its own classloader to keep 
its jars isolated from the rest of the application. We did this for a 
while, until we switched to Gradle and implemented the "careful 
exclusion" approach that Ken mentioned. Downside was the need to use 
reflection to invoke Tika (AutoDectectParser) and marshall properties in 
and out of Metadata. But it worked fine. We used this child-first 
classloader implementation: 
https://articles.qos.ch/delegation/src/java/ch/qos/ChildFirstClassLoader.java

Adam

On 11/25/2020 6:57 AM, Ken Krugler wrote:
> When we used Tika as a library with Hadoop map-reduce workflows, we 
> had to run it in a separate thread with a timeout, and leave the 
> thread as a zombie if/when it hung.
>
> As far as jar hell (a very real problem), you can either do careful 
> exclusions in your dependency specification (painful, and fragile) to 
> avoid pulling in the world and creating incompatibilities in jar 
> versions, or you could create a shaded Tika jar.
>
> — Ken
>
>> On Nov 25, 2020, at 6:41 AM, Tucker B <barbct5@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>> Not totally on topic but I think related to this thread. I'm 
>> currently exploring using tika as a library in Apache Spark. This 
>> approach suffers the same problems as using Tika as library mentioned 
>> above. Has anyone used Tika as a library in a Spark Job? Or would it 
>> still make sense to us something external like tika-server? That 
>> seems like it might be counter to the point of using Spark in the 
>> first place.
>>
>> On Tue, Nov 24, 2020 at 10:46 AM Slava G <slavago@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>>     We have been using tika as java library, for a few years now and
>>     parsing millions of different files each day. And we're switching
>>     now to tika server as bugs in different tika components
>>     (dependencies) caused issue like exit of the jvm, memory issues
>>     and so. Also, tika and it's different dependencies bringa lot of
>>     other dependencies, so it should simply the maintainability and
>>     reduce JAR hell.
>>
>>     So, this is our road from tika as java library to tika as a server 😀
>>
>>     Thanks
>>
>>     On Tue, Nov 24, 2020, 09:28 Ralph Soika <ralph.soika@imixs.com
>>     <ma...@imixs.com>> wrote:
>>
>>         Hi Robert,
>>
>>         in the sense of a microservice architecture it makes absolute
>>         sense to use Tika as a server/microservice component. As Tim
>>         Allison explained this helps you to separate your business
>>         requirements in isolated components (running in there own JVM).
>>
>>         If you don't need to link the Tika function closely to your
>>         code then use the server option wherever possible.
>>
>>
>>         Best regards
>>
>>         Ralph
>>
>>
>>         On 23.11.20 21:36, Robert Raines wrote:
>>>         Hi,
>>>
>>>         I am using Tika to extract text from Word Docs and PDFs
>>>         locally. It's great. Thank you Apache and Tika developers!
>>>
>>>         Could someone help me understand why Tika offers a
>>>         client-server option instead of just a code library? I am
>>>         sure there was/is a good reason, so I am curious if anyone
>>>         knows or if there are some resources that explain the
>>>         history of how/why Tika also has its API architecture.
>>>
>>>         Thanks so much,
>>>         Robert
>>>
>>>
>>>
>>         -- 
>>
>>         *Imixs Software Solutions GmbH*
>>         *Web:* www.imixs.com <http://www.imixs.com/> *Phone:* +49
>>         (0)89-452136 16
>>         *Office:* Agnes-Pockels-Bogen 1, 80992 München
>>         Registergericht: Amtsgericht Muenchen, HRB 136045
>>         Geschaeftsführer: Gaby Heinle u. Ralph Soika
>>
>>         *Imixs* is an open source company, read more: www.imixs.org
>>         <http://www.imixs.org/>
>>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com <http://www.scaleunlimited.com>
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>

Re: Why does Tika offer a client-server option?

Posted by Ken Krugler <kk...@transpac.com>.
When we used Tika as a library with Hadoop map-reduce workflows, we had to run it in a separate thread with a timeout, and leave the thread as a zombie if/when it hung.

As far as jar hell (a very real problem), you can either do careful exclusions in your dependency specification (painful, and fragile) to avoid pulling in the world and creating incompatibilities in jar versions, or you could create a shaded Tika jar.

— Ken

> On Nov 25, 2020, at 6:41 AM, Tucker B <ba...@gmail.com> wrote:
> 
> Not totally on topic but I think related to this thread. I'm currently exploring using tika as a library in Apache Spark. This approach suffers the same problems as using Tika as library mentioned above. Has anyone used Tika as a library in a Spark Job? Or would it still make sense to us something external like tika-server? That seems like it might be counter to the point of using Spark in the first place. 
> 
> On Tue, Nov 24, 2020 at 10:46 AM Slava G <slavago@gmail.com <ma...@gmail.com>> wrote:
> We have been using tika as java library, for a few years now and parsing millions of different files each day. And we're switching now to tika server as bugs in different tika components (dependencies) caused issue like exit of the jvm, memory issues and so. Also, tika and it's different dependencies bringa lot of other dependencies, so it should simply the maintainability and reduce JAR hell. 
> 
> So, this is our road from tika as java library to tika as a server 😀
> 
> Thanks 
> 
> On Tue, Nov 24, 2020, 09:28 Ralph Soika <ralph.soika@imixs.com <ma...@imixs.com>> wrote:
> Hi Robert,
> 
> in the sense of a microservice architecture it makes absolute sense to use Tika as a server/microservice component. As Tim Allison explained this helps you to separate your business requirements in isolated components (running in there own JVM). 
> If you don't need to link the Tika function closely to your code then use the server option wherever possible. 
> 
> Best regards
> 
> Ralph
> 
> On 23.11.20 21:36, Robert Raines wrote:
>> Hi,
>> 
>> I am using Tika to extract text from Word Docs and PDFs locally. It's great. Thank you Apache and Tika developers!  
>> 
>> Could someone help me understand why Tika offers a client-server option instead of just a code library? I am sure there was/is a good reason, so I am curious if anyone knows or if there are some resources that explain the history of how/why Tika also has its API architecture.
>> 
>> Thanks so much,
>> Robert
>> 
>> 
>> 
> -- 
> Imixs Software Solutions GmbH 
> Web: www.imixs.com <http://www.imixs.com/> Phone: +49 (0)89-452136 16 
> Office: Agnes-Pockels-Bogen 1, 80992 München
> Registergericht: Amtsgericht Muenchen, HRB 136045
> Geschaeftsführer: Gaby Heinle u. Ralph Soika
> 
> Imixs is an open source company, read more: www.imixs.org <http://www.imixs.org/>

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr


Re: Why does Tika offer a client-server option?

Posted by Tucker B <ba...@gmail.com>.
Not totally on topic but I think related to this thread. I'm currently
exploring using tika as a library in Apache Spark. This approach suffers
the same problems as using Tika as library mentioned above. Has anyone used
Tika as a library in a Spark Job? Or would it still make sense to us
something external like tika-server? That seems like it might be counter to
the point of using Spark in the first place.

On Tue, Nov 24, 2020 at 10:46 AM Slava G <sl...@gmail.com> wrote:

> We have been using tika as java library, for a few years now and parsing
> millions of different files each day. And we're switching now to tika
> server as bugs in different tika components (dependencies) caused issue
> like exit of the jvm, memory issues and so. Also, tika and it's different
> dependencies bringa lot of other dependencies, so it should simply the
> maintainability and reduce JAR hell.
>
> So, this is our road from tika as java library to tika as a server 😀
>
> Thanks
>
> On Tue, Nov 24, 2020, 09:28 Ralph Soika <ra...@imixs.com> wrote:
>
>> Hi Robert,
>>
>> in the sense of a microservice architecture it makes absolute sense to
>> use Tika as a server/microservice component. As Tim Allison explained this
>> helps you to separate your business requirements in isolated components
>> (running in there own JVM).
>>
>> If you don't need to link the Tika function closely to your code then use
>> the server option wherever possible.
>>
>>
>> Best regards
>>
>> Ralph
>>
>>
>> On 23.11.20 21:36, Robert Raines wrote:
>>
>> Hi,
>>
>> I am using Tika to extract text from Word Docs and PDFs locally. It's
>> great. Thank you Apache and Tika developers!
>>
>> Could someone help me understand why Tika offers a client-server option
>> instead of just a code library? I am sure there was/is a good reason, so I
>> am curious if anyone knows or if there are some resources that explain the
>> history of how/why Tika also has its API architecture.
>>
>> Thanks so much,
>> Robert
>>
>>
>>
>> --
>>
>> *Imixs Software Solutions GmbH*
>> *Web:* www.imixs.com *Phone:* +49 (0)89-452136 16
>> *Office:* Agnes-Pockels-Bogen 1, 80992 München
>> Registergericht: Amtsgericht Muenchen, HRB 136045
>> Geschaeftsführer: Gaby Heinle u. Ralph Soika
>>
>> *Imixs* is an open source company, read more: www.imixs.org
>>
>

Re: Why does Tika offer a client-server option?

Posted by Slava G <sl...@gmail.com>.
We have been using tika as java library, for a few years now and parsing
millions of different files each day. And we're switching now to tika
server as bugs in different tika components (dependencies) caused issue
like exit of the jvm, memory issues and so. Also, tika and it's different
dependencies bringa lot of other dependencies, so it should simply the
maintainability and reduce JAR hell.

So, this is our road from tika as java library to tika as a server 😀

Thanks

On Tue, Nov 24, 2020, 09:28 Ralph Soika <ra...@imixs.com> wrote:

> Hi Robert,
>
> in the sense of a microservice architecture it makes absolute sense to use
> Tika as a server/microservice component. As Tim Allison explained this
> helps you to separate your business requirements in isolated components
> (running in there own JVM).
>
> If you don't need to link the Tika function closely to your code then use
> the server option wherever possible.
>
>
> Best regards
>
> Ralph
>
>
> On 23.11.20 21:36, Robert Raines wrote:
>
> Hi,
>
> I am using Tika to extract text from Word Docs and PDFs locally. It's
> great. Thank you Apache and Tika developers!
>
> Could someone help me understand why Tika offers a client-server option
> instead of just a code library? I am sure there was/is a good reason, so I
> am curious if anyone knows or if there are some resources that explain the
> history of how/why Tika also has its API architecture.
>
> Thanks so much,
> Robert
>
>
>
> --
>
> *Imixs Software Solutions GmbH*
> *Web:* www.imixs.com *Phone:* +49 (0)89-452136 16
> *Office:* Agnes-Pockels-Bogen 1, 80992 München
> Registergericht: Amtsgericht Muenchen, HRB 136045
> Geschaeftsführer: Gaby Heinle u. Ralph Soika
>
> *Imixs* is an open source company, read more: www.imixs.org
>

Re: Why does Tika offer a client-server option?

Posted by Ralph Soika <ra...@imixs.com>.
Hi Robert,

in the sense of a microservice architecture it makes absolute sense to 
use Tika as a server/microservice component. As Tim Allison explained 
this helps you to separate your business requirements in isolated 
components (running in there own JVM).

If you don't need to link the Tika function closely to your code then 
use the server option wherever possible.


Best regards

Ralph


On 23.11.20 21:36, Robert Raines wrote:
> Hi,
>
> I am using Tika to extract text from Word Docs and PDFs locally. It's 
> great. Thank you Apache and Tika developers!
>
> Could someone help me understand why Tika offers a client-server 
> option instead of just a code library? I am sure there was/is a good 
> reason, so I am curious if anyone knows or if there are some resources 
> that explain the history of how/why Tika also has its API architecture.
>
> Thanks so much,
> Robert
>
>
>
-- 

*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika

*Imixs* is an open source company, read more: www.imixs.org 
<http://www.imixs.org>