You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Bob Paulin <bo...@bobpaulin.com> on 2020/08/27 02:24:18 UTC

OSGi support in Tika 2.0

Hi,

I wanted to discuss OSGi support in Tika 2.0.  My current thought is to
start with the minimum support which is to add bundle packaging to each
of the modules [1].  This will make the bundles usable is OSGi but will
leave users on there own for putting the right dependencies together for
usage.  From there we either stop or we can choose from a few different
options:

1) Tika Bundle

 This is an all encompassing uber jar with all the parsers and
dependencies we can legally get away with shipping with an Apache license.

Pros

Low bar to entry for novice OSGi users

Already exists in Tika 1.x

Cons

Difficult to maintain (very complicated maven-bundle-plugin config). 
This has broken in several releases leaving it unusable.


2) Tika module convenience bundles

This was part of the early 2.0 POC branch where each module had it's own
tika-bundle with just it's dependencies statically included.

Pros

Less sophisticated maven-bundle-plugin configuration

Low bar for novice OSGi users

Cons

More sub-modules to maintain.


There are of course other options but I think it's important to decide
if either, neither, or both of these options should be considered for
the initial 2.0 release.


- Bob


[1]  https://github.com/apache/tika/pull/344



Re: OSGi support in Tika 2.0

Posted by Tim Allison <ta...@apache.org>.
Bob,

Thank you for taking the lead on this discussion!

tl:dr -- I somewhat prefer tighter modularization at the risk of duplicate
dependencies, too.  The simplicity of higher level bundles might make sense
if we do a slight refactoring of the tika-parsers module.


After a week away, I'm thinking it might make sense to refactor the
tika-parsers module a bit to make explicit some of my underlying design
choices.  This basic design is somewhat already in main, but it is hidden.

Based on user feedback over the years, it feels like there are three
categories of parsers.

1) tika-parsers-module
  * pure Java ... no native libs
  * no parsers that are network dependent/rely on rest clients
  * "heavy" dependencies should be justified by the utility for the
"general user" -- this is admittedly and regrettably qualitative/hand-wavy,
but I want to allow POI's ooxml-schemas but disallow other large
dependencies for more niche formats
  * no ML, entity extraction or "recognizers"
  * packaging: separate modules as we now have, but packaged and shipped as
a single jar, and used as default by tika-app and tika-server

  EXCEPTION: OCR. Justification: this has become such a basic expectation
for many users and is tightly coupled in the PDFParser.  Further, the
current "dependency" requires a user to install tesseract, meaning that the
user has to choose to add this dependency.

2) tika-parsers-extended-module
  * native libs are allowed
  * parsers that are network dependent/rely on rest clients are allowed
  * no ML, entity extraction or recognizers
  * may have dependencies on parsers in tika-parsers-module
  * packaging: separate modules as we now have, packaged and released per
module (e.g. there will be a sqlite-parser jar which includes our parser
_and_ the native xerial.org's dependency); users will need to add their
chosen parsers to their classpath if they're using tika-app or tika-server

3) tika-parsers-advanced-module
  * enormous dependencies, native libs and rest clients are allowed
  * ML, entity extraction, recognizers are allowed
  * may have dependencies on parsers in tika-parsers-module
  * packaging: separate jars per sub module.  These jars will not be part
of the release.

I'll work on this in a separate branch today so that we can look at it
together.  I think it is important to get this consolidated before we make
OSGi decisions.

Note: I do not mean to hijack the OSGi discussion!  And, I'm sorry for not
realizing this earlier/including it in the refactoring a week ago, but here
we are. :D

Thank you, Bob and all, again!

Cheers,

          Tim




On Fri, Aug 28, 2020 at 10:46 AM Yegor Kozlov <ye...@dinom.ru> wrote:

> Hi Bob,
>
> I'd say decomposition into smaller bundles is the way to go. In my
> experience, OSGi bundles with too many dependencies are fragile and hard to
> maintain. In the worst case, a regression in a maven-bundle-plugin
> configuration would break a parser bundle instead of breaking all of them
> in the uber-jar.
>
> Static linking of dependencies should be fine, however  it can  increase
> the total size of the Tika distro because different parser bundles may
> embed the same transitive dependencies like Apache-Commons, etc.  The huge
> pros is that static linking will make the bundles self-contained.
> The alternative is to make dependencies optional, but in this case clients
> will have to solve the puzzle of adding them into their OSGi containers.
> It's doable, but will kill acceptance.
>
>
>  Regards,
>  Yegor
>
> On Thu, Aug 27, 2020 at 5:24 AM Bob Paulin <bo...@bobpaulin.com> wrote:
>
> > Hi,
> >
> > I wanted to discuss OSGi support in Tika 2.0.  My current thought is to
> > start with the minimum support which is to add bundle packaging to each
> of
> > the modules [1].  This will make the bundles usable is OSGi but will
> leave
> > users on there own for putting the right dependencies together for usage.
> > From there we either stop or we can choose from a few different options:
> > 1) Tika Bundle
> >
> >  This is an all encompassing uber jar with all the parsers and
> > dependencies we can legally get away with shipping with an Apache
> license.
> >
> > Pros
> >
> > Low bar to entry for novice OSGi users
> >
> > Already exists in Tika 1.x
> >
> > Cons
> >
> > Difficult to maintain (very complicated maven-bundle-plugin config).
> This
> > has broken in several releases leaving it unusable.
> >
> >
> > 2) Tika module convenience bundles
> >
> > This was part of the early 2.0 POC branch where each module had it's own
> > tika-bundle with just it's dependencies statically included.
> >
> > Pros
> >
> > Less sophisticated maven-bundle-plugin configuration
> >
> > Low bar for novice OSGi users
> >
> > Cons
> >
> > More sub-modules to maintain.
> >
> >
> > There are of course other options but I think it's important to decide if
> > either, neither, or both of these options should be considered for the
> > initial 2.0 release.
> >
> >
> > - Bob
> >
> >
> > [1]  https://github.com/apache/tika/pull/344
> >
> >
> >
>

Re: OSGi support in Tika 2.0

Posted by Yegor Kozlov <ye...@dinom.ru>.
Hi Bob,

I'd say decomposition into smaller bundles is the way to go. In my
experience, OSGi bundles with too many dependencies are fragile and hard to
maintain. In the worst case, a regression in a maven-bundle-plugin
configuration would break a parser bundle instead of breaking all of them
in the uber-jar.

Static linking of dependencies should be fine, however  it can  increase
the total size of the Tika distro because different parser bundles may
embed the same transitive dependencies like Apache-Commons, etc.  The huge
pros is that static linking will make the bundles self-contained.
The alternative is to make dependencies optional, but in this case clients
will have to solve the puzzle of adding them into their OSGi containers.
It's doable, but will kill acceptance.


 Regards,
 Yegor

On Thu, Aug 27, 2020 at 5:24 AM Bob Paulin <bo...@bobpaulin.com> wrote:

> Hi,
>
> I wanted to discuss OSGi support in Tika 2.0.  My current thought is to
> start with the minimum support which is to add bundle packaging to each of
> the modules [1].  This will make the bundles usable is OSGi but will leave
> users on there own for putting the right dependencies together for usage.
> From there we either stop or we can choose from a few different options:
> 1) Tika Bundle
>
>  This is an all encompassing uber jar with all the parsers and
> dependencies we can legally get away with shipping with an Apache license.
>
> Pros
>
> Low bar to entry for novice OSGi users
>
> Already exists in Tika 1.x
>
> Cons
>
> Difficult to maintain (very complicated maven-bundle-plugin config).  This
> has broken in several releases leaving it unusable.
>
>
> 2) Tika module convenience bundles
>
> This was part of the early 2.0 POC branch where each module had it's own
> tika-bundle with just it's dependencies statically included.
>
> Pros
>
> Less sophisticated maven-bundle-plugin configuration
>
> Low bar for novice OSGi users
>
> Cons
>
> More sub-modules to maintain.
>
>
> There are of course other options but I think it's important to decide if
> either, neither, or both of these options should be considered for the
> initial 2.0 release.
>
>
> - Bob
>
>
> [1]  https://github.com/apache/tika/pull/344
>
>
>