You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2014/06/17 22:16:14 UTC

Can some of tika-parsers module dependencies be made optional ?

Hi

A CXF colleague of mine started experimenting with providing a 
light-weight search utility code as part CXF Search API which would use 
a Lucene handler (shipped with CXF for a while) which can translate FIQL 
or ODATA queries into a composite Lucene Query and use it against Tika 
provided metadata and content. The idea is not new, I believe SOLR does 
some very advanced Tika based search. In CXF users would use it as part 
of their regular JAX-RS applications.

The problem seems to be that Tika Parsers module contains many 
dependencies that may not be needed by a specific custom JAX-RS application.

For example, we'd expect a given application dealing with PDF only, or a 
certain set of image formats only, or word docs only, etc.

I'm not sure how many Tika-parsers dependencies are strongly required 
for any Tika application and which can be made optional.

If Tika Parsers does have some possibly optional dependencies then would 
it make sense to make them as such for external Tika consumers having 
not to download all the deps ? It would make a difference IMHO

Thanks, Sergey



Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
Though we can exclude some of the dependencies directly in our pom.xml, 
that can work too.
We'll experiment a bit with which dependencies are absolutely needed for 
tika-parsers and which may be excluded.
Any feedback will be apreciated

Thanks, Sergey

On 17/06/14 21:16, Sergey Beryozkin wrote:
> Hi
>
> A CXF colleague of mine started experimenting with providing a
> light-weight search utility code as part CXF Search API which would use
> a Lucene handler (shipped with CXF for a while) which can translate FIQL
> or ODATA queries into a composite Lucene Query and use it against Tika
> provided metadata and content. The idea is not new, I believe SOLR does
> some very advanced Tika based search. In CXF users would use it as part
> of their regular JAX-RS applications.
>
> The problem seems to be that Tika Parsers module contains many
> dependencies that may not be needed by a specific custom JAX-RS
> application.
>
> For example, we'd expect a given application dealing with PDF only, or a
> certain set of image formats only, or word docs only, etc.
>
> I'm not sure how many Tika-parsers dependencies are strongly required
> for any Tika application and which can be made optional.
>
> If Tika Parsers does have some possibly optional dependencies then would
> it make sense to make them as such for external Tika consumers having
> not to download all the deps ? It would make a difference IMHO
>
> Thanks, Sergey
>
>


Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi
On 18/06/14 13:52, Ray Gauss wrote:
> I think for 2.0 we should consider splitting out parsers into their own projects for a streamlined dependency hierarchy then reassembling them with something like a tika-parsers-all artifact.
>
something like that will make sense IMHO
>
> On June 17, 2014 at 5:08:38 PM, Nick Burch (apache@gagravarr.org) wrote:
>> On Tue, 17 Jun 2014, Sergey Beryozkin wrote:
>>> The problem seems to be that Tika Parsers module contains many dependencies
>>> that may not be needed by a specific custom JAX-RS application.
>>>
>>> For example, we'd expect a given application dealing with PDF only, or a
>>> certain set of image formats only, or word docs only, etc.
>>>
>>> I'm not sure how many Tika-parsers dependencies are strongly required for any
>>> Tika application and which can be made optional.
>>
>> Just zap the Tika Parser dependency jars you don't want. All of the Tika
>> Parsers should by default silently fail if their dependencies are missing,
>> so after that going to /parsers/ you just won't see them there, and if you
>> try to parse that kind of document you'll get EmptyParser's result
>> instead.
Nick, sorry, missed your hint re zapping the unneeded dependencies :-), 
so I duplicated what you suggested in my earlier follow-up to this thread

Thanks, Sergey
>>
>> Nick
>>



Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 18 Jun 2014, Ken Krugler wrote:
> On Jun 18, 2014, at 9:08am, Nick Burch <ap...@gagravarr.org> wrote:
>> On Wed, 18 Jun 2014, Sergey Beryozkin wrote:
>>> Can we start with adding a section to Tika docs documenting the core 
>>> dependencies of the tike-parsers module to make the life a bit easier 
>>> for developers who do not expect the specific parser implementations 
>>> immediately downloaded ?
>>
>> Are you not just better off asking Maven nicely, and have it tell you 
>> that info itself? Much more likely to be accurate and up-to-date than 
>> something we cut and paste from Maven's output from time to time…
>
> I'm curious - assuming I only want to parse HTML and PDF (as an 
> example), then what's the right way to ask Maven nicely for what I need 
> to include?

That's a different question though. Sergey wanted the docs to list the 
core dependencies of Tika Parsers, which Maven can tell you. (Direct 
dependencies are listed in pom, direct + indirect from "mvn
dependency:list"

If you just want one Tika parser, the simplest way is to:
  * Use tika-app --list-parser-details to find out which class handles
    the mimetype you want
  * Grep the tika parsers source tree for that class's package, and get
    the list of imports it makes
  * Change you pom which includes tika parsers to have an exclusion for *
    on the tika parsers dependency
  * Explicitly list the artifacts that provide the imports you saw

Yes, it is largely manual, but at the point where you want to exclude a 
bunch of tika parsers your use case is IMHO special enough that you're 
doing enough enough work that the above isn't much extra.

(For most people, having everything there as standard is what you want to 
start with)

Nick

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Nick,

On Jun 18, 2014, at 9:08am, Nick Burch <ap...@gagravarr.org> wrote:

> On Wed, 18 Jun 2014, Sergey Beryozkin wrote:
>> Can we start with adding a section to Tika docs documenting the core dependencies of the tike-parsers module to make the life a bit easier for developers who do not expect the specific parser implementations immediately downloaded ?
> 
> Are you not just better off asking Maven nicely, and have it tell you that info itself? Much more likely to be accurate and up-to-date than something we cut and paste from Maven's output from time to time…

I'm curious - assuming I only want to parse HTML and PDF (as an example), then what's the right way to ask Maven nicely for what I need to include?

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 18 Jun 2014, Sergey Beryozkin wrote:
> Can we start with adding a section to Tika docs documenting the core 
> dependencies of the tike-parsers module to make the life a bit easier for 
> developers who do not expect the specific parser implementations immediately 
> downloaded ?

Are you not just better off asking Maven nicely, and have it tell you that 
info itself? Much more likely to be accurate and up-to-date than something 
we cut and paste from Maven's output from time to time...

Nick

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 18 Jun 2014, Ken Krugler wrote:
> I'm not much of a Maven maven, so what's the right way to manually pull 
> some subset of parsers & dependencies?

If you didn't want POI, you'd do something like:

   <dependencies>
     <dependency>
       <groupId>${project.groupId}</groupId>
       <artifactId>tika-parsers</artifactId>
       <version>${project.version}</version>
       <exclusions>
         <exclusion>
           <groupId>org.apache.poi</groupId>
           <artifactId>*</artifactId>
         </exclusion>
       </exclusions>
     </dependency>
   <dependencies>

If you wanted everything except pdfbox, I believe you can do an exclusion 
with groupId and artifactId as both *, then explicitly list the dependency 
in your own pom dependencies section for pdfbox

Nick

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Nick,

On Jun 18, 2014, at 9:07am, Nick Burch <ap...@gagravarr.org> wrote:

> On Wed, 18 Jun 2014, Sergey Beryozkin wrote:
>> The reason we need it is that CXF can not ship all of Tika Parser dependencies because CXF will only offer a light-weight Tika-aware handler.
> 
> Sounds like you just want to depend on tika-core then, and not tika-parsers. That'll give you mime magic detection, and all the parser framework, but no parsers, and none of the parser dependencies. (You could manually pull in one or two parsers + their dependencies if you wanted to)


I'm not much of a Maven maven, so what's the right way to manually pull some subset of parsers & dependencies?

Thanks,

-- Ken


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Nick
On 18/06/14 17:07, Nick Burch wrote:
> On Wed, 18 Jun 2014, Sergey Beryozkin wrote:
>> The reason we need it is that CXF can not ship all of Tika Parser
>> dependencies because CXF will only offer a light-weight Tika-aware
>> handler.
>
> Sounds like you just want to depend on tika-core then, and not
> tika-parsers. That'll give you mime magic detection, and all the parser
> framework, but no parsers, and none of the parser dependencies. (You
> could manually pull in one or two parsers + their dependencies if you
> wanted to)

Yes, depending on tika-core only made out main source code compile, 
adding tika-parsers with a test scope made the tests using PDFParser 
pass. Thanks for a hint, I did not know tika-core was enough.

So the issue of the dependency management is then relayed to the future 
users of our API.
The use case we target is something like this: we have a CXF user with 
some custom application accepting documents in some limited set of 
formats (say PDF & Word or Excel only or some photo shop kind of 
application managing few types of images only). We tell this user that 
CXF can help with searching through this document and the user can 
integrate it into the application. We tell a user to add Tika parsers 
dependency, users asks us how to get only PDF and Excel deps added only.

I don't want to recommend them to go via the exclusion process and 
possibly check the source tree as you suggested in the other email :-)

Is tika-parsers effectively a collection of various parser dependencies 
with no some common dependencies all of other parser implementation will 
need, with tika-core providing a support ? If so why don't we document 
which well known modules support which file formats ? This wel let users 
don't worry about tika-parsers at all and select the dependencies they 
need by checking the docs ?

Sergey

>
> Nick



Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 18 Jun 2014, Sergey Beryozkin wrote:
> The reason we need it is that CXF can not ship all of Tika Parser 
> dependencies because CXF will only offer a light-weight Tika-aware handler.

Sounds like you just want to depend on tika-core then, and not 
tika-parsers. That'll give you mime magic detection, and all the parser 
framework, but no parsers, and none of the parser dependencies. (You could 
manually pull in one or two parsers + their dependencies if you wanted to)

Nick

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
On 18/06/14 16:58, Sergey Beryozkin wrote:
> Hi Nick
> On 18/06/14 16:38, Nick Burch wrote:
>> On Wed, 18 Jun 2014, Ray Gauss wrote:
>>> I think for 2.0 we should consider splitting out parsers into their
>>> own projects for a streamlined dependency hierarchy then reassembling
>>> them with something like a tika-parsers-all artifact.
>>
>> We had another thread on that not that long ago, where someone cautioned
>> against breaking it up into too many pieces. We also have fairly
>> frequent posts on the users list from people who aren't getting any
>> content returned, because they've forgotten to include a dependency on
>> tika-parsers
>>
>> I'm not convinced that splitting tika parsers into 20 odd dependencies
>> is really going to help more than it hinders - more people will get
>> confused by missing dependencies they really wanted, and anyone with
>> special needs about what does/doesn't get parsed is probably going to be
>> taking such care that they can just exclude everything by default anyway
>> and just pull in what they need. I'd probably rather we just gave an
>> example pom snippet that shows how to exclude all except one thing, and
>> let people with special cases work from there.
>>
> Can we start with adding a section to Tika docs documenting the core
> dependencies of the tike-parsers module to make the life a bit easier
> for developers who do not expect the specific parser implementations
> immediately downloaded ?
> And listing the parser implementation dependencies too ? So we'd exclude
> them all from our CXF module depending on Tika and point the users to
> the section listing the well known Tika parser implementations for them
> to choose what they need ?
The reason we need it is that CXF can not ship all of Tika Parser 
dependencies because CXF will only offer a light-weight Tika-aware handler.

By the way I'll be happy to help with the documentation if you let me 
know the details here

Cheers, Sergey

>
> Thanks, Sergey
>
>> Nick
>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Blog: http://sberyozkin.blogspot.com

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Nick
On 18/06/14 16:38, Nick Burch wrote:
> On Wed, 18 Jun 2014, Ray Gauss wrote:
>> I think for 2.0 we should consider splitting out parsers into their
>> own projects for a streamlined dependency hierarchy then reassembling
>> them with something like a tika-parsers-all artifact.
>
> We had another thread on that not that long ago, where someone cautioned
> against breaking it up into too many pieces. We also have fairly
> frequent posts on the users list from people who aren't getting any
> content returned, because they've forgotten to include a dependency on
> tika-parsers
>
> I'm not convinced that splitting tika parsers into 20 odd dependencies
> is really going to help more than it hinders - more people will get
> confused by missing dependencies they really wanted, and anyone with
> special needs about what does/doesn't get parsed is probably going to be
> taking such care that they can just exclude everything by default anyway
> and just pull in what they need. I'd probably rather we just gave an
> example pom snippet that shows how to exclude all except one thing, and
> let people with special cases work from there.
>
Can we start with adding a section to Tika docs documenting the core 
dependencies of the tike-parsers module to make the life a bit easier 
for developers who do not expect the specific parser implementations 
immediately downloaded ?
And listing the parser implementation dependencies too ? So we'd exclude 
them all from our CXF module depending on Tika and point the users to 
the section listing the well known Tika parser implementations for them 
to choose what they need ?

Thanks, Sergey

> Nick


Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Nick Burch <ap...@gagravarr.org>.
On Sat, 21 Jun 2014, Ray Gauss wrote:
> I’d have to respectfully disagree with most of those points but if 
> there’s that much resistance to the idea I’ll drop it.

Please make your case!

Nick

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Ray Gauss <ra...@alfresco.com>.
I’d have to respectfully disagree with most of those points but if there’s that much resistance to the idea I’ll drop it.

Cheers,

Ray


On June 19, 2014 at 3:22:14 PM, Nick Burch (apache@gagravarr.org) wrote:
> On Thu, 19 Jun 2014, Ray Gauss wrote:
> > The point of a tika-parsers-all artifact would be a single dependency
> > that re-aggregates everything so that downstream projects could work the
> > same way they do now and not worry about missing dependencies.
> >
> > What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
>  
> We already have users confused by the current split between tika-core and
> tika-parsers - see users list for example. We already have users confused
> by what dependencies they need with the current poms setup. Splitting is
> going to make that a lot worse. (POI, as a related example, sees plenty of
> confused users who've got mis-matched jars and problems. Splitting is
> going to make that a lot worse.)
>  
> We have previously tried pushing parsers out of the tika parser jar and
> into other jars, eg ones maintained by external groups, but on the whole
> it hasn't been a great success. Keeping them in sync, dealing with
> different cycles, applying updates, keeping them consistent, building in a
> sensible length of time, all of that would be harder with a pile of
> modules.
>  
> If we were to split out out to the level needed by some of the use cases
> mentioned, we'd have so many parser modules it'd be a nightmare to
> maintain, and would case problems mentioned above. (People in other
> threads have cautioned on these problems). If we split into just a handful
> of sub modules, then many of the uses cases mentioned still have to do
> work to pick out the bits they need
>  
> I still believe that the main use case of tika is "everything included",
> and especially that's the beginners use case, so I think we should focus
> on keeping that easy. Peeling out just some bits feels like an advanced
> use case to me, so I'd rather we put the requirement for effort onto those
> folks, rather than onto newbies and people on the typical uses. I'd
> therefore much rather we provide advanced docs/help on excluding some
> bits, rather than pull it out into a pile of different modules.
>  
> Nick

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
Sorry for few typos in the text below, was typing too fast...
On 15/07/14 10:57, Sergey Beryozkin wrote:
> Hi All,
> I've opened 2 JIRA issues, see [1] and [2].
>
> [1] is about documenting the 3rd party transitive tika-parser
> dependencies to help Maven users to exclude the kibs not required in a
> given project.
>
> Help on resolving [1] form true Tika experts like Nick and others would
> be appreciated :-).
>
> I can volunteer to fixing [2], but not only because that involves much
> less work  :-).
>
> in [2] (which strongly depends on the resolution of [1]) I proposed
> either making tika-parsers pom optionally depend on the 3rd party libs
> (in which case I can promise Nick I will answer every user query related
> to the new tika-parsers module not strongly depending on all of 3rd
> party libs :-)) or keep tika-parsers intact and introduce a
> tika-parsers-optional pom.
>
> There's also a 3rd solution mentioned earlier involving a complete
> modularization of tika-parsers - that would be a more involved and
> possibly more sensitive solution so I'm not adding it to the list in [2]
> for now to make it easier for us to come to some resolution...
>
> Thanks, Sergey
>
> [1] https://issues.apache.org/jira/browse/TIKA-1367
> [2] https://issues.apache.org/jira/browse/TIKA-1368
>
> On 14/07/14 22:19, Sergey Beryozkin wrote:
>> Hi Nick, All,
>>
>> I've revisited this subject recently. I have to admit it is not ideal.
>> I see new parsers are added every two weeks or so and having downstream
>> tika-parsers consumers keeping excluding all the required dependencies
>> (which can change dynamically - well, it's not that dynamic :-) but you
>> see what I mean) can present the problem.
>>
>> How about this approach:
>>
>> Introduce tika-parsers-optional module (pom.xml only) which will be
>> exactly the same as tika-parsers except that tika-parsers-optional will
>> depend on tika-parsers but have all the specific parser libs
>> dependencies set as optional. Effectively this pom.xml will only have
>> a single dependency with
>>
>> <dependency>
>>    <artifactId>tika-parsers</dependency>
>>    <exclusions>
>>      <!-- exclude specific parser libs -->
>>    </exclusions>
>> </dependency>
>>
>> The users who do not want to spend time on excluding all and every
>> parser lib deps they do not need will use tika-parsers-optional and look
>> at the Tika Documentation and add only those specific deps that they
>> need.
>>
>> To be honest this seems to be a rather messy approach, having
>> tika-parsers using optional parser lib dependencies and getting users
>> add those libs they actually need (again after looking at the
>> documentation) is better. This is not that distabilizing to be honest -
>> any practical application is expected to be aware of the actual file
>> formats and parser libs supporting those formats.
>>
>> But I'd like to propose tika-parsers-optional as an alternative, its
>> advantage is that it can all of existing tika-parsers users in peace...
>>
>> Thoughts ?
>>
>> Thanks, Sergey
>>
>>
>>
>> On 19/06/14 20:22, Nick Burch wrote:
>>> On Thu, 19 Jun 2014, Ray Gauss wrote:
>>>> The point of a tika-parsers-all artifact would be a single dependency
>>>> that re-aggregates everything so that downstream projects could work
>>>> the same way they do now and not worry about missing dependencies.
>>>>
>>>> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
>>>
>>> We already have users confused by the current split between tika-core
>>> and tika-parsers - see users list for example. We already have users
>>> confused by what dependencies they need with the current poms setup.
>>> Splitting is going to make that a lot worse. (POI, as a related example,
>>> sees plenty of confused users who've got mis-matched jars and problems.
>>> Splitting is going to make that a lot worse.)
>>>
>>> We have previously tried pushing parsers out of the tika parser jar and
>>> into other jars, eg ones maintained by external groups, but on the whole
>>> it hasn't been a great success. Keeping them in sync, dealing with
>>> different cycles, applying updates, keeping them consistent, building in
>>> a sensible length of time, all of that would be harder with a pile of
>>> modules.
>>>
>>> If we were to split out out to the level needed by some of the use cases
>>> mentioned, we'd have so many parser modules it'd be a nightmare to
>>> maintain, and would case problems mentioned above. (People in other
>>> threads have cautioned on these problems). If we split into just a
>>> handful of sub modules, then many of the uses cases mentioned still have
>>> to do work to pick out the bits they need
>>>
>>> I still believe that the main use case of tika is "everything included",
>>> and especially that's the beginners use case, so I think we should focus
>>> on keeping that easy. Peeling out just some bits feels like an advanced
>>> use case to me, so I'd rather we put the requirement for effort onto
>>> those folks, rather than onto newbies and people on the typical uses.
>>> I'd therefore much rather we provide advanced docs/help on excluding
>>> some bits, rather than pull it out into a pile of different modules.
>>>
>>> Nick
>>
>>
>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Blog: http://sberyozkin.blogspot.com

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi,
On 15/07/14 12:34, Ray Gauss wrote:
> I’m not sure the third option is much more work up front than pulling apart the transitive dependencies for documentation purposes, though it is more sensitive as you say.
>
As far as I understand the 3rd option would require introducing many 
micro modules. I guess the bigger tika-parsers becomes the more 
important the 3rd option becomes. I'd just like to come up with some 
intermediary decision to get things moving a bit. 3rd option can be 
reviewed for 2.0 as you suggested, etc...

> Just to confirm, with any of the other solutions we would need to manually document not just immediate dependencies but all transitive dependencies for each new parser added going forward rather than letting Maven automagically manage things, correct?
>
I'm thinking of documenting only top-level transitive dependencies, 
example, if we want to work with PDFParser then we'd see that a pdf-box 
lib is used for it (documenting pdf-box own dependencies is out of 
scope). If tike-parsers has some dependencies which are required by most 
of Parser implementations then they'd stay as is there...

Cheers, Sergey

> Regards,
>
> Ray
>
>
> On July 15, 2014 at 5:58:11 AM, Sergey Beryozkin (sberyozkin@gmail.com) wrote:
>> Hi All,
>> I've opened 2 JIRA issues, see [1] and [2].
>>
>> [1] is about documenting the 3rd party transitive tika-parser
>> dependencies to help Maven users to exclude the kibs not required in a
>> given project.
>>
>> Help on resolving [1] form true Tika experts like Nick and others would
>> be appreciated :-).
>>
>> I can volunteer to fixing [2], but not only because that involves much
>> less work :-).
>>
>> in [2] (which strongly depends on the resolution of [1]) I proposed
>> either making tika-parsers pom optionally depend on the 3rd party libs
>> (in which case I can promise Nick I will answer every user query related
>> to the new tika-parsers module not strongly depending on all of 3rd
>> party libs :-)) or keep tika-parsers intact and introduce a
>> tika-parsers-optional pom.
>>
>> There's also a 3rd solution mentioned earlier involving a complete
>> modularization of tika-parsers - that would be a more involved and
>> possibly more sensitive solution so I'm not adding it to the list in [2]
>> for now to make it easier for us to come to some resolution...
>>
>> Thanks, Sergey
>>
>> [1] https://issues.apache.org/jira/browse/TIKA-1367
>> [2] https://issues.apache.org/jira/browse/TIKA-1368
>>
>> On 14/07/14 22:19, Sergey Beryozkin wrote:
>>> Hi Nick, All,
>>>
>>> I've revisited this subject recently. I have to admit it is not ideal.
>>> I see new parsers are added every two weeks or so and having downstream
>>> tika-parsers consumers keeping excluding all the required dependencies
>>> (which can change dynamically - well, it's not that dynamic :-) but you
>>> see what I mean) can present the problem.
>>>
>>> How about this approach:
>>>
>>> Introduce tika-parsers-optional module (pom.xml only) which will be
>>> exactly the same as tika-parsers except that tika-parsers-optional will
>>> depend on tika-parsers but have all the specific parser libs
>>> dependencies set as optional. Effectively this pom.xml will only have
>>> a single dependency with
>>>
>>>
>>> tika-parsers
>>>
>>>
>>>
>>>
>>>
>>> The users who do not want to spend time on excluding all and every
>>> parser lib deps they do not need will use tika-parsers-optional and look
>>> at the Tika Documentation and add only those specific deps that they need.
>>>
>>> To be honest this seems to be a rather messy approach, having
>>> tika-parsers using optional parser lib dependencies and getting users
>>> add those libs they actually need (again after looking at the
>>> documentation) is better. This is not that distabilizing to be honest -
>>> any practical application is expected to be aware of the actual file
>>> formats and parser libs supporting those formats.
>>>
>>> But I'd like to propose tika-parsers-optional as an alternative, its
>>> advantage is that it can all of existing tika-parsers users in peace...
>>>
>>> Thoughts ?
>>>
>>> Thanks, Sergey
>>>
>>>
>>>
>>> On 19/06/14 20:22, Nick Burch wrote:
>>>> On Thu, 19 Jun 2014, Ray Gauss wrote:
>>>>> The point of a tika-parsers-all artifact would be a single dependency
>>>>> that re-aggregates everything so that downstream projects could work
>>>>> the same way they do now and not worry about missing dependencies.
>>>>>
>>>>> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
>>>>
>>>> We already have users confused by the current split between tika-core
>>>> and tika-parsers - see users list for example. We already have users
>>>> confused by what dependencies they need with the current poms setup.
>>>> Splitting is going to make that a lot worse. (POI, as a related example,
>>>> sees plenty of confused users who've got mis-matched jars and problems.
>>>> Splitting is going to make that a lot worse.)
>>>>
>>>> We have previously tried pushing parsers out of the tika parser jar and
>>>> into other jars, eg ones maintained by external groups, but on the whole
>>>> it hasn't been a great success. Keeping them in sync, dealing with
>>>> different cycles, applying updates, keeping them consistent, building in
>>>> a sensible length of time, all of that would be harder with a pile of
>>>> modules.
>>>>
>>>> If we were to split out out to the level needed by some of the use cases
>>>> mentioned, we'd have so many parser modules it'd be a nightmare to
>>>> maintain, and would case problems mentioned above. (People in other
>>>> threads have cautioned on these problems). If we split into just a
>>>> handful of sub modules, then many of the uses cases mentioned still have
>>>> to do work to pick out the bits they need
>>>>
>>>> I still believe that the main use case of tika is "everything included",
>>>> and especially that's the beginners use case, so I think we should focus
>>>> on keeping that easy. Peeling out just some bits feels like an advanced
>>>> use case to me, so I'd rather we put the requirement for effort onto
>>>> those folks, rather than onto newbies and people on the typical uses.
>>>> I'd therefore much rather we provide advanced docs/help on excluding
>>>> some bits, rather than pull it out into a pile of different modules.
>>>>
>>>> Nick
>>>
>>>
>>
>>


Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Ray Gauss <ra...@alfresco.com>.
I’m not sure the third option is much more work up front than pulling apart the transitive dependencies for documentation purposes, though it is more sensitive as you say.

Just to confirm, with any of the other solutions we would need to manually document not just immediate dependencies but all transitive dependencies for each new parser added going forward rather than letting Maven automagically manage things, correct?

Regards,

Ray


On July 15, 2014 at 5:58:11 AM, Sergey Beryozkin (sberyozkin@gmail.com) wrote:
> Hi All,
> I've opened 2 JIRA issues, see [1] and [2].
>  
> [1] is about documenting the 3rd party transitive tika-parser
> dependencies to help Maven users to exclude the kibs not required in a
> given project.
>  
> Help on resolving [1] form true Tika experts like Nick and others would
> be appreciated :-).
>  
> I can volunteer to fixing [2], but not only because that involves much
> less work :-).
>  
> in [2] (which strongly depends on the resolution of [1]) I proposed
> either making tika-parsers pom optionally depend on the 3rd party libs
> (in which case I can promise Nick I will answer every user query related
> to the new tika-parsers module not strongly depending on all of 3rd
> party libs :-)) or keep tika-parsers intact and introduce a
> tika-parsers-optional pom.
>  
> There's also a 3rd solution mentioned earlier involving a complete
> modularization of tika-parsers - that would be a more involved and
> possibly more sensitive solution so I'm not adding it to the list in [2]
> for now to make it easier for us to come to some resolution...
>  
> Thanks, Sergey
>  
> [1] https://issues.apache.org/jira/browse/TIKA-1367
> [2] https://issues.apache.org/jira/browse/TIKA-1368
>  
> On 14/07/14 22:19, Sergey Beryozkin wrote:
> > Hi Nick, All,
> >
> > I've revisited this subject recently. I have to admit it is not ideal.
> > I see new parsers are added every two weeks or so and having downstream
> > tika-parsers consumers keeping excluding all the required dependencies
> > (which can change dynamically - well, it's not that dynamic :-) but you
> > see what I mean) can present the problem.
> >
> > How about this approach:
> >
> > Introduce tika-parsers-optional module (pom.xml only) which will be
> > exactly the same as tika-parsers except that tika-parsers-optional will
> > depend on tika-parsers but have all the specific parser libs
> > dependencies set as optional. Effectively this pom.xml will only have
> > a single dependency with
> >
> >  
> > tika-parsers
> >  
> >  
> >  
> >  
> >
> > The users who do not want to spend time on excluding all and every
> > parser lib deps they do not need will use tika-parsers-optional and look
> > at the Tika Documentation and add only those specific deps that they need.
> >
> > To be honest this seems to be a rather messy approach, having
> > tika-parsers using optional parser lib dependencies and getting users
> > add those libs they actually need (again after looking at the
> > documentation) is better. This is not that distabilizing to be honest -
> > any practical application is expected to be aware of the actual file
> > formats and parser libs supporting those formats.
> >
> > But I'd like to propose tika-parsers-optional as an alternative, its
> > advantage is that it can all of existing tika-parsers users in peace...
> >
> > Thoughts ?
> >
> > Thanks, Sergey
> >
> >
> >
> > On 19/06/14 20:22, Nick Burch wrote:
> >> On Thu, 19 Jun 2014, Ray Gauss wrote:
> >>> The point of a tika-parsers-all artifact would be a single dependency
> >>> that re-aggregates everything so that downstream projects could work
> >>> the same way they do now and not worry about missing dependencies.
> >>>
> >>> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
> >>
> >> We already have users confused by the current split between tika-core
> >> and tika-parsers - see users list for example. We already have users
> >> confused by what dependencies they need with the current poms setup.
> >> Splitting is going to make that a lot worse. (POI, as a related example,
> >> sees plenty of confused users who've got mis-matched jars and problems.
> >> Splitting is going to make that a lot worse.)
> >>
> >> We have previously tried pushing parsers out of the tika parser jar and
> >> into other jars, eg ones maintained by external groups, but on the whole
> >> it hasn't been a great success. Keeping them in sync, dealing with
> >> different cycles, applying updates, keeping them consistent, building in
> >> a sensible length of time, all of that would be harder with a pile of
> >> modules.
> >>
> >> If we were to split out out to the level needed by some of the use cases
> >> mentioned, we'd have so many parser modules it'd be a nightmare to
> >> maintain, and would case problems mentioned above. (People in other
> >> threads have cautioned on these problems). If we split into just a
> >> handful of sub modules, then many of the uses cases mentioned still have
> >> to do work to pick out the bits they need
> >>
> >> I still believe that the main use case of tika is "everything included",
> >> and especially that's the beginners use case, so I think we should focus
> >> on keeping that easy. Peeling out just some bits feels like an advanced
> >> use case to me, so I'd rather we put the requirement for effort onto
> >> those folks, rather than onto newbies and people on the typical uses.
> >> I'd therefore much rather we provide advanced docs/help on excluding
> >> some bits, rather than pull it out into a pile of different modules.
> >>
> >> Nick
> >
> >
>  
>  

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi All,
I've opened 2 JIRA issues, see [1] and [2].

[1] is about documenting the 3rd party transitive tika-parser 
dependencies to help Maven users to exclude the kibs not required in a 
given project.

Help on resolving [1] form true Tika experts like Nick and others would 
be appreciated :-).

I can volunteer to fixing [2], but not only because that involves much 
less work  :-).

in [2] (which strongly depends on the resolution of [1]) I proposed 
either making tika-parsers pom optionally depend on the 3rd party libs 
(in which case I can promise Nick I will answer every user query related 
to the new tika-parsers module not strongly depending on all of 3rd 
party libs :-)) or keep tika-parsers intact and introduce a 
tika-parsers-optional pom.

There's also a 3rd solution mentioned earlier involving a complete 
modularization of tika-parsers - that would be a more involved and 
possibly more sensitive solution so I'm not adding it to the list in [2] 
for now to make it easier for us to come to some resolution...

Thanks, Sergey

[1] https://issues.apache.org/jira/browse/TIKA-1367
[2] https://issues.apache.org/jira/browse/TIKA-1368

On 14/07/14 22:19, Sergey Beryozkin wrote:
> Hi Nick, All,
>
> I've revisited this subject recently. I have to admit it is not ideal.
> I see new parsers are added every two weeks or so and having downstream
> tika-parsers consumers keeping excluding all the required dependencies
> (which can change dynamically - well, it's not that dynamic :-) but you
> see what I mean) can present the problem.
>
> How about this approach:
>
> Introduce tika-parsers-optional module (pom.xml only) which will be
> exactly the same as tika-parsers except that tika-parsers-optional will
> depend on tika-parsers but have all the specific parser libs
> dependencies set as optional. Effectively this pom.xml will only have
> a single dependency with
>
> <dependency>
>    <artifactId>tika-parsers</dependency>
>    <exclusions>
>      <!-- exclude specific parser libs -->
>    </exclusions>
> </dependency>
>
> The users who do not want to spend time on excluding all and every
> parser lib deps they do not need will use tika-parsers-optional and look
> at the Tika Documentation and add only those specific deps that they need.
>
> To be honest this seems to be a rather messy approach, having
> tika-parsers using optional parser lib dependencies and getting users
> add those libs they actually need (again after looking at the
> documentation) is better. This is not that distabilizing to be honest -
> any practical application is expected to be aware of the actual file
> formats and parser libs supporting those formats.
>
> But I'd like to propose tika-parsers-optional as an alternative, its
> advantage is that it can all of existing tika-parsers users in peace...
>
> Thoughts ?
>
> Thanks, Sergey
>
>
>
> On 19/06/14 20:22, Nick Burch wrote:
>> On Thu, 19 Jun 2014, Ray Gauss wrote:
>>> The point of a tika-parsers-all artifact would be a single dependency
>>> that re-aggregates everything so that downstream projects could work
>>> the same way they do now and not worry about missing dependencies.
>>>
>>> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
>>
>> We already have users confused by the current split between tika-core
>> and tika-parsers - see users list for example. We already have users
>> confused by what dependencies they need with the current poms setup.
>> Splitting is going to make that a lot worse. (POI, as a related example,
>> sees plenty of confused users who've got mis-matched jars and problems.
>> Splitting is going to make that a lot worse.)
>>
>> We have previously tried pushing parsers out of the tika parser jar and
>> into other jars, eg ones maintained by external groups, but on the whole
>> it hasn't been a great success. Keeping them in sync, dealing with
>> different cycles, applying updates, keeping them consistent, building in
>> a sensible length of time, all of that would be harder with a pile of
>> modules.
>>
>> If we were to split out out to the level needed by some of the use cases
>> mentioned, we'd have so many parser modules it'd be a nightmare to
>> maintain, and would case problems mentioned above. (People in other
>> threads have cautioned on these problems). If we split into just a
>> handful of sub modules, then many of the uses cases mentioned still have
>> to do work to pick out the bits they need
>>
>> I still believe that the main use case of tika is "everything included",
>> and especially that's the beginners use case, so I think we should focus
>> on keeping that easy. Peeling out just some bits feels like an advanced
>> use case to me, so I'd rather we put the requirement for effort onto
>> those folks, rather than onto newbies and people on the typical uses.
>> I'd therefore much rather we provide advanced docs/help on excluding
>> some bits, rather than pull it out into a pile of different modules.
>>
>> Nick
>
>


Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Nick, All,

I've revisited this subject recently. I have to admit it is not ideal.
I see new parsers are added every two weeks or so and having downstream 
tika-parsers consumers keeping excluding all the required dependencies 
(which can change dynamically - well, it's not that dynamic :-) but you 
see what I mean) can present the problem.

How about this approach:

Introduce tika-parsers-optional module (pom.xml only) which will be 
exactly the same as tika-parsers except that tika-parsers-optional will 
depend on tika-parsers but have all the specific parser libs 
dependencies set as optional. Effectively this pom.xml will only have
a single dependency with

<dependency>
   <artifactId>tika-parsers</dependency>
   <exclusions>
     <!-- exclude specific parser libs -->
   </exclusions>
</dependency>

The users who do not want to spend time on excluding all and every 
parser lib deps they do not need will use tika-parsers-optional and look 
at the Tika Documentation and add only those specific deps that they need.

To be honest this seems to be a rather messy approach, having 
tika-parsers using optional parser lib dependencies and getting users 
add those libs they actually need (again after looking at the 
documentation) is better. This is not that distabilizing to be honest - 
any practical application is expected to be aware of the actual file 
formats and parser libs supporting those formats.

But I'd like to propose tika-parsers-optional as an alternative, its 
advantage is that it can all of existing tika-parsers users in peace...

Thoughts ?

Thanks, Sergey



On 19/06/14 20:22, Nick Burch wrote:
> On Thu, 19 Jun 2014, Ray Gauss wrote:
>> The point of a tika-parsers-all artifact would be a single dependency
>> that re-aggregates everything so that downstream projects could work
>> the same way they do now and not worry about missing dependencies.
>>
>> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
>
> We already have users confused by the current split between tika-core
> and tika-parsers - see users list for example. We already have users
> confused by what dependencies they need with the current poms setup.
> Splitting is going to make that a lot worse. (POI, as a related example,
> sees plenty of confused users who've got mis-matched jars and problems.
> Splitting is going to make that a lot worse.)
>
> We have previously tried pushing parsers out of the tika parser jar and
> into other jars, eg ones maintained by external groups, but on the whole
> it hasn't been a great success. Keeping them in sync, dealing with
> different cycles, applying updates, keeping them consistent, building in
> a sensible length of time, all of that would be harder with a pile of
> modules.
>
> If we were to split out out to the level needed by some of the use cases
> mentioned, we'd have so many parser modules it'd be a nightmare to
> maintain, and would case problems mentioned above. (People in other
> threads have cautioned on these problems). If we split into just a
> handful of sub modules, then many of the uses cases mentioned still have
> to do work to pick out the bits they need
>
> I still believe that the main use case of tika is "everything included",
> and especially that's the beginners use case, so I think we should focus
> on keeping that easy. Peeling out just some bits feels like an advanced
> use case to me, so I'd rather we put the requirement for effort onto
> those folks, rather than onto newbies and people on the typical uses.
> I'd therefore much rather we provide advanced docs/help on excluding
> some bits, rather than pull it out into a pile of different modules.
>
> Nick



Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 19 Jun 2014, Ray Gauss wrote:
> The point of a tika-parsers-all artifact would be a single dependency 
> that re-aggregates everything so that downstream projects could work the 
> same way they do now and not worry about missing dependencies.
>
> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?

We already have users confused by the current split between tika-core and 
tika-parsers - see users list for example. We already have users confused 
by what dependencies they need with the current poms setup. Splitting is 
going to make that a lot worse. (POI, as a related example, sees plenty of 
confused users who've got mis-matched jars and problems. Splitting is 
going to make that a lot worse.)

We have previously tried pushing parsers out of the tika parser jar and 
into other jars, eg ones maintained by external groups, but on the whole 
it hasn't been a great success. Keeping them in sync, dealing with 
different cycles, applying updates, keeping them consistent, building in a 
sensible length of time, all of that would be harder with a pile of 
modules.

If we were to split out out to the level needed by some of the use cases 
mentioned, we'd have so many parser modules it'd be a nightmare to 
maintain, and would case problems mentioned above. (People in other 
threads have cautioned on these problems). If we split into just a handful 
of sub modules, then many of the uses cases mentioned still have to do 
work to pick out the bits they need

I still believe that the main use case of tika is "everything included", 
and especially that's the beginners use case, so I think we should focus 
on keeping that easy. Peeling out just some bits feels like an advanced 
use case to me, so I'd rather we put the requirement for effort onto those 
folks, rather than onto newbies and people on the typical uses. I'd 
therefore much rather we provide advanced docs/help on excluding some 
bits, rather than pull it out into a pile of different modules.

Nick

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Ken Krugler <kk...@transpac.com>.
On Jun 19, 2014, at 2:53am, Sergey Beryozkin <sb...@gmail.com> wrote:

> Hi
> On 19/06/14 01:58, Ray Gauss wrote:
>> The point of a tika-parsers-all artifact would be a single dependency that re-aggregates everything so that downstream projects could work the same way they do now and not worry about missing dependencies.
>> 
>> Meanwhile people that just want PDF parsing could declare only the tike-parser-pdf dependency.
>> 
>> We could go the other way, focusing on exclusions, but as we add more parsers for different types those downstream projects will have to be constantly be updating those exclusion lists.
>> 
>> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
>> 
> 
> From what I understand the concern is the proliferation of many new micro modules.
> 
> I wonder if tika-parsers has anything extra but only specific parser implementations with some related support modules. If yes then effectively it is 'tika-parsers-all'.
> 
> If it were the case then I'd settle for documenting the individual dependencies supporting specific file extensions/media types

That's essentially what I was wondering about when I asked Nick:

>> I'm curious - assuming I only want to parse HTML and PDF (as an example), then what's the right way to ask Maven nicely for what I need to include?

The current approach seems (still) to be:

> * Use tika-app --list-parser-details to find out which class handles
>   the mimetype you want
> * Grep the tika parsers source tree for that class's package, and get
>   the list of imports it makes
> * Explicitly list the artifacts that provide the imports you saw

Unfortunately this is error-prone. There's no real way to know for sure that you have all the required dependent jars.

My approach has been to use Maven to build the dependency graph, then whack the biggest unneeded transitive jars to reduce the footprint of our Hadoop job jar.

-- Ken


>> On June 18, 2014 at 11:39:00 AM, Nick Burch (apache@gagravarr.org) wrote:
>>> On Wed, 18 Jun 2014, Ray Gauss wrote:
>>>> I think for 2.0 we should consider splitting out parsers into their own
>>>> projects for a streamlined dependency hierarchy then reassembling them
>>>> with something like a tika-parsers-all artifact.
>>> 
>>> We had another thread on that not that long ago, where someone cautioned
>>> against breaking it up into too many pieces. We also have fairly frequent
>>> posts on the users list from people who aren't getting any content
>>> returned, because they've forgotten to include a dependency on
>>> tika-parsers
>>> 
>>> I'm not convinced that splitting tika parsers into 20 odd dependencies is
>>> really going to help more than it hinders - more people will get confused
>>> by missing dependencies they really wanted, and anyone with special needs
>>> about what does/doesn't get parsed is probably going to be taking such
>>> care that they can just exclude everything by default anyway and just pull
>>> in what they need. I'd probably rather we just gave an example pom snippet
>>> that shows how to exclude all except one thing, and let people with
>>> special cases work from there.
>>> 
>>> Nick


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi
On 19/06/14 01:58, Ray Gauss wrote:
> The point of a tika-parsers-all artifact would be a single dependency that re-aggregates everything so that downstream projects could work the same way they do now and not worry about missing dependencies.
>
> Meanwhile people that just want PDF parsing could declare only the tike-parser-pdf dependency.
>
> We could go the other way, focusing on exclusions, but as we add more parsers for different types those downstream projects will have to be constantly be updating those exclusion lists.
>
> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
>

 From what I understand the concern is the proliferation of many new 
micro modules.

I wonder if tika-parsers has anything extra but only specific parser 
implementations with some related support modules. If yes then 
effectively it is 'tika-parsers-all'.

If it were the case then I'd settle for documenting the individual 
dependencies supporting specific file extensions/media types

Cheers, Sergey

>
>
> On June 18, 2014 at 11:39:00 AM, Nick Burch (apache@gagravarr.org) wrote:
>> On Wed, 18 Jun 2014, Ray Gauss wrote:
>>> I think for 2.0 we should consider splitting out parsers into their own
>>> projects for a streamlined dependency hierarchy then reassembling them
>>> with something like a tika-parsers-all artifact.
>>
>> We had another thread on that not that long ago, where someone cautioned
>> against breaking it up into too many pieces. We also have fairly frequent
>> posts on the users list from people who aren't getting any content
>> returned, because they've forgotten to include a dependency on
>> tika-parsers
>>
>> I'm not convinced that splitting tika parsers into 20 odd dependencies is
>> really going to help more than it hinders - more people will get confused
>> by missing dependencies they really wanted, and anyone with special needs
>> about what does/doesn't get parsed is probably going to be taking such
>> care that they can just exclude everything by default anyway and just pull
>> in what they need. I'd probably rather we just gave an example pom snippet
>> that shows how to exclude all except one thing, and let people with
>> special cases work from there.
>>
>> Nick
>>


Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Ray Gauss <ra...@alfresco.com>.
The point of a tika-parsers-all artifact would be a single dependency that re-aggregates everything so that downstream projects could work the same way they do now and not worry about missing dependencies.

Meanwhile people that just want PDF parsing could declare only the tike-parser-pdf dependency.

We could go the other way, focusing on exclusions, but as we add more parsers for different types those downstream projects will have to be constantly be updating those exclusion lists.

What’s the disadvantage for splitting things up (in a 2.0 timeframe)?



On June 18, 2014 at 11:39:00 AM, Nick Burch (apache@gagravarr.org) wrote:
> On Wed, 18 Jun 2014, Ray Gauss wrote:
> > I think for 2.0 we should consider splitting out parsers into their own
> > projects for a streamlined dependency hierarchy then reassembling them
> > with something like a tika-parsers-all artifact.
>  
> We had another thread on that not that long ago, where someone cautioned
> against breaking it up into too many pieces. We also have fairly frequent
> posts on the users list from people who aren't getting any content
> returned, because they've forgotten to include a dependency on
> tika-parsers
>  
> I'm not convinced that splitting tika parsers into 20 odd dependencies is
> really going to help more than it hinders - more people will get confused
> by missing dependencies they really wanted, and anyone with special needs
> about what does/doesn't get parsed is probably going to be taking such
> care that they can just exclude everything by default anyway and just pull
> in what they need. I'd probably rather we just gave an example pom snippet
> that shows how to exclude all except one thing, and let people with
> special cases work from there.
>  
> Nick
>  

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 18 Jun 2014, Ray Gauss wrote:
> I think for 2.0 we should consider splitting out parsers into their own 
> projects for a streamlined dependency hierarchy then reassembling them 
> with something like a tika-parsers-all artifact.

We had another thread on that not that long ago, where someone cautioned 
against breaking it up into too many pieces. We also have fairly frequent 
posts on the users list from people who aren't getting any content 
returned, because they've forgotten to include a dependency on 
tika-parsers

I'm not convinced that splitting tika parsers into 20 odd dependencies is 
really going to help more than it hinders - more people will get confused 
by missing dependencies they really wanted, and anyone with special needs 
about what does/doesn't get parsed is probably going to be taking such 
care that they can just exclude everything by default anyway and just pull 
in what they need. I'd probably rather we just gave an example pom snippet 
that shows how to exclude all except one thing, and let people with 
special cases work from there.

Nick

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Ray Gauss <ra...@alfresco.com>.
I think for 2.0 we should consider splitting out parsers into their own projects for a streamlined dependency hierarchy then reassembling them with something like a tika-parsers-all artifact.


On June 17, 2014 at 5:08:38 PM, Nick Burch (apache@gagravarr.org) wrote:
> On Tue, 17 Jun 2014, Sergey Beryozkin wrote:
> > The problem seems to be that Tika Parsers module contains many dependencies
> > that may not be needed by a specific custom JAX-RS application.
> >
> > For example, we'd expect a given application dealing with PDF only, or a
> > certain set of image formats only, or word docs only, etc.
> >
> > I'm not sure how many Tika-parsers dependencies are strongly required for any
> > Tika application and which can be made optional.
> 
> Just zap the Tika Parser dependency jars you don't want. All of the Tika
> Parsers should by default silently fail if their dependencies are missing,
> so after that going to /parsers/ you just won't see them there, and if you
> try to parse that kind of document you'll get EmptyParser's result
> instead.
> 
> Nick
> 

Re: Can some of tika-parsers module dependencies be made optional ?

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 17 Jun 2014, Sergey Beryozkin wrote:
> The problem seems to be that Tika Parsers module contains many dependencies 
> that may not be needed by a specific custom JAX-RS application.
>
> For example, we'd expect a given application dealing with PDF only, or a 
> certain set of image formats only, or word docs only, etc.
>
> I'm not sure how many Tika-parsers dependencies are strongly required for any 
> Tika application and which can be made optional.

Just zap the Tika Parser dependency jars you don't want. All of the Tika 
Parsers should by default silently fail if their dependencies are missing, 
so after that going to /parsers/ you just won't see them there, and if you 
try to parse that kind of document you'll get EmptyParser's result 
instead.

Nick