You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2024/04/08 10:30:11 UTC

Replace baseline language detection in tika-server and tika-app in 3.x?

All,
  As Brian pointed out, optimaize is no longer maintained, and it has
some dependencies that have aged out. Should we replace our baseline
langdetect in tika-app and tika-server in 3.x?
  I'd say that we should go with our OpenNLP based language detection,
but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
Java 17.
  Thoughts?

            Best,

                Tim

---------- Forwarded message ---------
From: Brian Laskey <bl...@us.ibm.com>
Date: Fri, Mar 8, 2024 at 2:38 PM
Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
and parsers
To: user@tika.apache.org <us...@tika.apache.org>


Hi Tim



Thanks this is helpful.



For tika-app we found the dependency on org.apache.tika »
tika-langdetect-optimaize brings in some older 3rd party jars, and
unfortunately it appears that the com.optimaize.languagedetector »
language-detector 0.6 is unmaintained so it’s dependencies on
vulnerable versions of guava (18.0) cause us problems with security
scans. I could be wrong but I don’t believe we need this component for
our usage of just detect and parse?



We have a sort of microservice process (java based) which is ingesting
files parsed from tika. It was nice that we could separate the tika
process in it’s own heap space as a separate java process rather than
adding it to our app, but I suppose we could work around that



Thank you

Brian Laskey



From: Tim Allison <ta...@apache.org>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Friday, March 8, 2024 at 9:44 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
tiki-core / and parsers



Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
tika-parsers-standard-package. Which components are you trying to
avoid? tika-serialization and jackson? boilerpipecontenthandler and
some of its dependencies? I ask, because we

Hi Brian,

  A few thoughts:



1) tika-app is basically tika-core + tika-parsers-standard-package.
Which components are you trying to avoid? tika-serialization and
jackson? boilerpipecontenthandler and some of its dependencies? I ask,
because we could factor out a tika-app-core with no parsers in Tika
3.x, which is what we do now with tika-server-core and
tika-server-standard.



2) Unrelated, there are probably more efficient ways of running Tika
than calling it per file on the commandline. That is a robust option,
at least!



If all you want is detect and text extraction, and you want to run it
from the commandline, write two classes, whose main()s call:

System.out.println(Tika.detect(File f));



or



System.out.println(Tika.parseToString(File f))



On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <bl...@us.ibm.com> wrote:

Hello Tika community,



Our team is migrating away from usage of tika-app.jar (2.6 currently)
to something with more minimal third party dependencies which we can
control.



Is there any good documentation or pathway to describe how a team
could map the tika-app functionality we use to the same behavior using
just tika-core and tika-parsers-standard-package

(I assume)?



The tika-app functions we use today are:



Mime-type detection

java -jar tika-app.jar -d <file>



and

Text extraction attempts

java -jar tika-app.jar -t <file>



Is there a subset of tika parser jars we would need to include to have
equivalent functionality if we wrote our own wrapper main class?



Thank you,

Brian Laskey

Java 17 for 3.x?

Posted by Tim Allison <ta...@apache.org>.
Update title for this thread...

It has been 7 months since we had consensus to move to Java 11 for 3.x
[0]. Should we reopen the discussion of moving to Java 17 for 3.x as
proposed by Eric, or should we stick with the Java 11 plan for now?

[0] https://lists.apache.org/thread/c330b12h1fvmq8x1099mgw3tfs0gcp6q



On Mon, Apr 8, 2024 at 12:09 PM Tim Allison <ta...@apache.org> wrote:
>
> From October 2023:
> https://www.brilworks.com/blog/java-11-countdown-to-end-of-support/
>
> Getting 3.x out has taken longer than I had anticipated. Should we
> reopen the 17 vs 11 discussion given Eric's input? Or do we continue
> with the plan to target 11 in 3x for the foreseeable future?
>
> On Mon, Apr 8, 2024 at 9:22 AM Eric Pugh
> <ep...@opensourceconnections.com> wrote:
> >
> > Time to move on?   Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP is already there….    Java 11 is EOL and has been for a while….
> >
> > Any other file parsers that are being optimized to take advantage of the newer features that are in recent Java versions that we know about?
> >
> > > On Apr 8, 2024, at 7:02 AM, Tim Allison <ta...@apache.org> wrote:
> > >
> > > Sorry, more correctly:
> > >
> > > OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
> > > requires Java 17 and our 3.x is still on 11.
> > >
> > > On Mon, Apr 8, 2024 at 6:30 AM Tim Allison <ta...@apache.org> wrote:
> > >>
> > >> All,
> > >>  As Brian pointed out, optimaize is no longer maintained, and it has
> > >> some dependencies that have aged out. Should we replace our baseline
> > >> langdetect in tika-app and tika-server in 3.x?
> > >>  I'd say that we should go with our OpenNLP based language detection,
> > >> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
> > >> Java 17.
> > >>  Thoughts?
> > >>
> > >>            Best,
> > >>
> > >>                Tim
> > >>
> > >> ---------- Forwarded message ---------
> > >> From: Brian Laskey <bl...@us.ibm.com>
> > >> Date: Fri, Mar 8, 2024 at 2:38 PM
> > >> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
> > >> and parsers
> > >> To: user@tika.apache.org <us...@tika.apache.org>
> > >>
> > >>
> > >> Hi Tim
> > >>
> > >>
> > >>
> > >> Thanks this is helpful.
> > >>
> > >>
> > >>
> > >> For tika-app we found the dependency on org.apache.tika »
> > >> tika-langdetect-optimaize brings in some older 3rd party jars, and
> > >> unfortunately it appears that the com.optimaize.languagedetector »
> > >> language-detector 0.6 is unmaintained so it’s dependencies on
> > >> vulnerable versions of guava (18.0) cause us problems with security
> > >> scans. I could be wrong but I don’t believe we need this component for
> > >> our usage of just detect and parse?
> > >>
> > >>
> > >>
> > >> We have a sort of microservice process (java based) which is ingesting
> > >> files parsed from tika. It was nice that we could separate the tika
> > >> process in it’s own heap space as a separate java process rather than
> > >> adding it to our app, but I suppose we could work around that
> > >>
> > >>
> > >>
> > >> Thank you
> > >>
> > >> Brian Laskey
> > >>
> > >>
> > >>
> > >> From: Tim Allison <ta...@apache.org>
> > >> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> > >> Date: Friday, March 8, 2024 at 9:44 AM
> > >> To: "user@tika.apache.org" <us...@tika.apache.org>
> > >> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
> > >> tiki-core / and parsers
> > >>
> > >>
> > >>
> > >> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
> > >> tika-parsers-standard-package. Which components are you trying to
> > >> avoid? tika-serialization and jackson? boilerpipecontenthandler and
> > >> some of its dependencies? I ask, because we
> > >>
> > >> Hi Brian,
> > >>
> > >>  A few thoughts:
> > >>
> > >>
> > >>
> > >> 1) tika-app is basically tika-core + tika-parsers-standard-package.
> > >> Which components are you trying to avoid? tika-serialization and
> > >> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
> > >> because we could factor out a tika-app-core with no parsers in Tika
> > >> 3.x, which is what we do now with tika-server-core and
> > >> tika-server-standard.
> > >>
> > >>
> > >>
> > >> 2) Unrelated, there are probably more efficient ways of running Tika
> > >> than calling it per file on the commandline. That is a robust option,
> > >> at least!
> > >>
> > >>
> > >>
> > >> If all you want is detect and text extraction, and you want to run it
> > >> from the commandline, write two classes, whose main()s call:
> > >>
> > >> System.out.println(Tika.detect(File f));
> > >>
> > >>
> > >>
> > >> or
> > >>
> > >>
> > >>
> > >> System.out.println(Tika.parseToString(File f))
> > >>
> > >>
> > >>
> > >> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <bl...@us.ibm.com> wrote:
> > >>
> > >> Hello Tika community,
> > >>
> > >>
> > >>
> > >> Our team is migrating away from usage of tika-app.jar (2.6 currently)
> > >> to something with more minimal third party dependencies which we can
> > >> control.
> > >>
> > >>
> > >>
> > >> Is there any good documentation or pathway to describe how a team
> > >> could map the tika-app functionality we use to the same behavior using
> > >> just tika-core and tika-parsers-standard-package
> > >>
> > >> (I assume)?
> > >>
> > >>
> > >>
> > >> The tika-app functions we use today are:
> > >>
> > >>
> > >>
> > >> Mime-type detection
> > >>
> > >> java -jar tika-app.jar -d <file>
> > >>
> > >>
> > >>
> > >> and
> > >>
> > >> Text extraction attempts
> > >>
> > >> java -jar tika-app.jar -t <file>
> > >>
> > >>
> > >>
> > >> Is there a subset of tika parser jars we would need to include to have
> > >> equivalent functionality if we wrote our own wrapper main class?
> > >>
> > >>
> > >>
> > >> Thank you,
> > >>
> > >> Brian Laskey
> >
> > _______________________
> > Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> > This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
> >

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

Posted by Tim Allison <ta...@apache.org>.
From October 2023:
https://www.brilworks.com/blog/java-11-countdown-to-end-of-support/

Getting 3.x out has taken longer than I had anticipated. Should we
reopen the 17 vs 11 discussion given Eric's input? Or do we continue
with the plan to target 11 in 3x for the foreseeable future?

On Mon, Apr 8, 2024 at 9:22 AM Eric Pugh
<ep...@opensourceconnections.com> wrote:
>
> Time to move on?   Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP is already there….    Java 11 is EOL and has been for a while….
>
> Any other file parsers that are being optimized to take advantage of the newer features that are in recent Java versions that we know about?
>
> > On Apr 8, 2024, at 7:02 AM, Tim Allison <ta...@apache.org> wrote:
> >
> > Sorry, more correctly:
> >
> > OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
> > requires Java 17 and our 3.x is still on 11.
> >
> > On Mon, Apr 8, 2024 at 6:30 AM Tim Allison <ta...@apache.org> wrote:
> >>
> >> All,
> >>  As Brian pointed out, optimaize is no longer maintained, and it has
> >> some dependencies that have aged out. Should we replace our baseline
> >> langdetect in tika-app and tika-server in 3.x?
> >>  I'd say that we should go with our OpenNLP based language detection,
> >> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
> >> Java 17.
> >>  Thoughts?
> >>
> >>            Best,
> >>
> >>                Tim
> >>
> >> ---------- Forwarded message ---------
> >> From: Brian Laskey <bl...@us.ibm.com>
> >> Date: Fri, Mar 8, 2024 at 2:38 PM
> >> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
> >> and parsers
> >> To: user@tika.apache.org <us...@tika.apache.org>
> >>
> >>
> >> Hi Tim
> >>
> >>
> >>
> >> Thanks this is helpful.
> >>
> >>
> >>
> >> For tika-app we found the dependency on org.apache.tika »
> >> tika-langdetect-optimaize brings in some older 3rd party jars, and
> >> unfortunately it appears that the com.optimaize.languagedetector »
> >> language-detector 0.6 is unmaintained so it’s dependencies on
> >> vulnerable versions of guava (18.0) cause us problems with security
> >> scans. I could be wrong but I don’t believe we need this component for
> >> our usage of just detect and parse?
> >>
> >>
> >>
> >> We have a sort of microservice process (java based) which is ingesting
> >> files parsed from tika. It was nice that we could separate the tika
> >> process in it’s own heap space as a separate java process rather than
> >> adding it to our app, but I suppose we could work around that
> >>
> >>
> >>
> >> Thank you
> >>
> >> Brian Laskey
> >>
> >>
> >>
> >> From: Tim Allison <ta...@apache.org>
> >> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> >> Date: Friday, March 8, 2024 at 9:44 AM
> >> To: "user@tika.apache.org" <us...@tika.apache.org>
> >> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
> >> tiki-core / and parsers
> >>
> >>
> >>
> >> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
> >> tika-parsers-standard-package. Which components are you trying to
> >> avoid? tika-serialization and jackson? boilerpipecontenthandler and
> >> some of its dependencies? I ask, because we
> >>
> >> Hi Brian,
> >>
> >>  A few thoughts:
> >>
> >>
> >>
> >> 1) tika-app is basically tika-core + tika-parsers-standard-package.
> >> Which components are you trying to avoid? tika-serialization and
> >> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
> >> because we could factor out a tika-app-core with no parsers in Tika
> >> 3.x, which is what we do now with tika-server-core and
> >> tika-server-standard.
> >>
> >>
> >>
> >> 2) Unrelated, there are probably more efficient ways of running Tika
> >> than calling it per file on the commandline. That is a robust option,
> >> at least!
> >>
> >>
> >>
> >> If all you want is detect and text extraction, and you want to run it
> >> from the commandline, write two classes, whose main()s call:
> >>
> >> System.out.println(Tika.detect(File f));
> >>
> >>
> >>
> >> or
> >>
> >>
> >>
> >> System.out.println(Tika.parseToString(File f))
> >>
> >>
> >>
> >> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <bl...@us.ibm.com> wrote:
> >>
> >> Hello Tika community,
> >>
> >>
> >>
> >> Our team is migrating away from usage of tika-app.jar (2.6 currently)
> >> to something with more minimal third party dependencies which we can
> >> control.
> >>
> >>
> >>
> >> Is there any good documentation or pathway to describe how a team
> >> could map the tika-app functionality we use to the same behavior using
> >> just tika-core and tika-parsers-standard-package
> >>
> >> (I assume)?
> >>
> >>
> >>
> >> The tika-app functions we use today are:
> >>
> >>
> >>
> >> Mime-type detection
> >>
> >> java -jar tika-app.jar -d <file>
> >>
> >>
> >>
> >> and
> >>
> >> Text extraction attempts
> >>
> >> java -jar tika-app.jar -t <file>
> >>
> >>
> >>
> >> Is there a subset of tika parser jars we would need to include to have
> >> equivalent functionality if we wrote our own wrapper main class?
> >>
> >>
> >>
> >> Thank you,
> >>
> >> Brian Laskey
>
> _______________________
> Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

Posted by Eric Pugh <ep...@opensourceconnections.com>.
Time to move on?   Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP is already there….    Java 11 is EOL and has been for a while….   

Any other file parsers that are being optimized to take advantage of the newer features that are in recent Java versions that we know about?   

> On Apr 8, 2024, at 7:02 AM, Tim Allison <ta...@apache.org> wrote:
> 
> Sorry, more correctly:
> 
> OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
> requires Java 17 and our 3.x is still on 11.
> 
> On Mon, Apr 8, 2024 at 6:30 AM Tim Allison <ta...@apache.org> wrote:
>> 
>> All,
>>  As Brian pointed out, optimaize is no longer maintained, and it has
>> some dependencies that have aged out. Should we replace our baseline
>> langdetect in tika-app and tika-server in 3.x?
>>  I'd say that we should go with our OpenNLP based language detection,
>> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
>> Java 17.
>>  Thoughts?
>> 
>>            Best,
>> 
>>                Tim
>> 
>> ---------- Forwarded message ---------
>> From: Brian Laskey <bl...@us.ibm.com>
>> Date: Fri, Mar 8, 2024 at 2:38 PM
>> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
>> and parsers
>> To: user@tika.apache.org <us...@tika.apache.org>
>> 
>> 
>> Hi Tim
>> 
>> 
>> 
>> Thanks this is helpful.
>> 
>> 
>> 
>> For tika-app we found the dependency on org.apache.tika »
>> tika-langdetect-optimaize brings in some older 3rd party jars, and
>> unfortunately it appears that the com.optimaize.languagedetector »
>> language-detector 0.6 is unmaintained so it’s dependencies on
>> vulnerable versions of guava (18.0) cause us problems with security
>> scans. I could be wrong but I don’t believe we need this component for
>> our usage of just detect and parse?
>> 
>> 
>> 
>> We have a sort of microservice process (java based) which is ingesting
>> files parsed from tika. It was nice that we could separate the tika
>> process in it’s own heap space as a separate java process rather than
>> adding it to our app, but I suppose we could work around that
>> 
>> 
>> 
>> Thank you
>> 
>> Brian Laskey
>> 
>> 
>> 
>> From: Tim Allison <ta...@apache.org>
>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>> Date: Friday, March 8, 2024 at 9:44 AM
>> To: "user@tika.apache.org" <us...@tika.apache.org>
>> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
>> tiki-core / and parsers
>> 
>> 
>> 
>> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
>> tika-parsers-standard-package. Which components are you trying to
>> avoid? tika-serialization and jackson? boilerpipecontenthandler and
>> some of its dependencies? I ask, because we
>> 
>> Hi Brian,
>> 
>>  A few thoughts:
>> 
>> 
>> 
>> 1) tika-app is basically tika-core + tika-parsers-standard-package.
>> Which components are you trying to avoid? tika-serialization and
>> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
>> because we could factor out a tika-app-core with no parsers in Tika
>> 3.x, which is what we do now with tika-server-core and
>> tika-server-standard.
>> 
>> 
>> 
>> 2) Unrelated, there are probably more efficient ways of running Tika
>> than calling it per file on the commandline. That is a robust option,
>> at least!
>> 
>> 
>> 
>> If all you want is detect and text extraction, and you want to run it
>> from the commandline, write two classes, whose main()s call:
>> 
>> System.out.println(Tika.detect(File f));
>> 
>> 
>> 
>> or
>> 
>> 
>> 
>> System.out.println(Tika.parseToString(File f))
>> 
>> 
>> 
>> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <bl...@us.ibm.com> wrote:
>> 
>> Hello Tika community,
>> 
>> 
>> 
>> Our team is migrating away from usage of tika-app.jar (2.6 currently)
>> to something with more minimal third party dependencies which we can
>> control.
>> 
>> 
>> 
>> Is there any good documentation or pathway to describe how a team
>> could map the tika-app functionality we use to the same behavior using
>> just tika-core and tika-parsers-standard-package
>> 
>> (I assume)?
>> 
>> 
>> 
>> The tika-app functions we use today are:
>> 
>> 
>> 
>> Mime-type detection
>> 
>> java -jar tika-app.jar -d <file>
>> 
>> 
>> 
>> and
>> 
>> Text extraction attempts
>> 
>> java -jar tika-app.jar -t <file>
>> 
>> 
>> 
>> Is there a subset of tika parser jars we would need to include to have
>> equivalent functionality if we wrote our own wrapper main class?
>> 
>> 
>> 
>> Thank you,
>> 
>> Brian Laskey

_______________________
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Re: Replace baseline language detection in tika-server and tika-app in 3.x?

Posted by Tim Allison <ta...@apache.org>.
Sorry, more correctly:

OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
requires Java 17 and our 3.x is still on 11.

On Mon, Apr 8, 2024 at 6:30 AM Tim Allison <ta...@apache.org> wrote:
>
> All,
>   As Brian pointed out, optimaize is no longer maintained, and it has
> some dependencies that have aged out. Should we replace our baseline
> langdetect in tika-app and tika-server in 3.x?
>   I'd say that we should go with our OpenNLP based language detection,
> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
> Java 17.
>   Thoughts?
>
>             Best,
>
>                 Tim
>
> ---------- Forwarded message ---------
> From: Brian Laskey <bl...@us.ibm.com>
> Date: Fri, Mar 8, 2024 at 2:38 PM
> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
> and parsers
> To: user@tika.apache.org <us...@tika.apache.org>
>
>
> Hi Tim
>
>
>
> Thanks this is helpful.
>
>
>
> For tika-app we found the dependency on org.apache.tika »
> tika-langdetect-optimaize brings in some older 3rd party jars, and
> unfortunately it appears that the com.optimaize.languagedetector »
> language-detector 0.6 is unmaintained so it’s dependencies on
> vulnerable versions of guava (18.0) cause us problems with security
> scans. I could be wrong but I don’t believe we need this component for
> our usage of just detect and parse?
>
>
>
> We have a sort of microservice process (java based) which is ingesting
> files parsed from tika. It was nice that we could separate the tika
> process in it’s own heap space as a separate java process rather than
> adding it to our app, but I suppose we could work around that
>
>
>
> Thank you
>
> Brian Laskey
>
>
>
> From: Tim Allison <ta...@apache.org>
> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> Date: Friday, March 8, 2024 at 9:44 AM
> To: "user@tika.apache.org" <us...@tika.apache.org>
> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
> tiki-core / and parsers
>
>
>
> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
> tika-parsers-standard-package. Which components are you trying to
> avoid? tika-serialization and jackson? boilerpipecontenthandler and
> some of its dependencies? I ask, because we
>
> Hi Brian,
>
>   A few thoughts:
>
>
>
> 1) tika-app is basically tika-core + tika-parsers-standard-package.
> Which components are you trying to avoid? tika-serialization and
> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
> because we could factor out a tika-app-core with no parsers in Tika
> 3.x, which is what we do now with tika-server-core and
> tika-server-standard.
>
>
>
> 2) Unrelated, there are probably more efficient ways of running Tika
> than calling it per file on the commandline. That is a robust option,
> at least!
>
>
>
> If all you want is detect and text extraction, and you want to run it
> from the commandline, write two classes, whose main()s call:
>
> System.out.println(Tika.detect(File f));
>
>
>
> or
>
>
>
> System.out.println(Tika.parseToString(File f))
>
>
>
> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <bl...@us.ibm.com> wrote:
>
> Hello Tika community,
>
>
>
> Our team is migrating away from usage of tika-app.jar (2.6 currently)
> to something with more minimal third party dependencies which we can
> control.
>
>
>
> Is there any good documentation or pathway to describe how a team
> could map the tika-app functionality we use to the same behavior using
> just tika-core and tika-parsers-standard-package
>
> (I assume)?
>
>
>
> The tika-app functions we use today are:
>
>
>
> Mime-type detection
>
> java -jar tika-app.jar -d <file>
>
>
>
> and
>
> Text extraction attempts
>
> java -jar tika-app.jar -t <file>
>
>
>
> Is there a subset of tika parser jars we would need to include to have
> equivalent functionality if we wrote our own wrapper main class?
>
>
>
> Thank you,
>
> Brian Laskey