You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Nick Burch <ni...@apache.org> on 2014/11/23 18:12:14 UTC

Subsets of tika parsers redux

Hi All

During ApacheCon, I had a chance to chat with Sergey about the "subset of 
Tika Parsers" issue that bubbles up from time to time. It seemed to work 
well, and I think we both now have a better idea of the other's needs and 
concerns, which is good :)

As is shown on our list from time to time, but more commonly elsewhere, we 
have some users who are confused already by the split between tika-core 
and tika-parsers. Anything that fragments further is going to cause more 
issues for that kind of user.

On the other hand, there are potential users out there who want just a 
handful of parsers, in a simple and easy and small way, who don't know a 
lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those 
are using OSGi, but not all.

One suggested solution is to just document what dependencies of 
tika-parsers can be excluded at the maven level to disable certain parsers 
+ shrink the resulting dependency tree. However, that requires manual 
updates, manual checking, and like our examples on the website risk 
getting out of date without automated checking.

Discussion then turned to our move to get all the examples for the website 
into svn, with unit tests, and having the website pull those from svn on 
the fly to always get the latest tested version.


That led to an idea. Not sure if it'll work yet, but...

What about having multiple Tika OSGi bundles? Continue with the "full" 
bundle as now, but also have ones for "pdf", "microsoft office", "images" 
etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if 
they only wanted a handful of parsers, or the full one as now.

The smart bit - we have unit tests for these smaller bundles. These unit 
tests ensure that the desired parsers still work on their smaller bundle. 
These unit tests also ensure that unwanted parsers don't work, thus 
flagging up if extra dependencies have snuck though.

Finally, we pull out the includes/excludes information that went into the 
bundle, and display that for non-OSGi users. A non-OSGi person wanting 
"tika with pdf only" could then look at what the tika-pdf-bundle does and 
doesn't use, and from that know what maven level dependencies to keep and 
which to exclude


This new plan would mean having to tweak our build to support multiple 
bundles, and potentially tweaking our bundles so that you could load 
tika-pdf + tika-image and have those two play nicely together. It'd also 
need some new unit tests, and the work to figure out what to 
include/exclude for each of our handful of "common" cases. It should, 
however, deliver a way for OSGi and non-OSGi people to get just a subset 
if that's all they want.

Can anyone see a flaw with this plan? Anyone see a better way? Anyone want 
to help? :)

Nick

Re: Subsets of tika parsers redux

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hey Nick,

This sounds like a great plan to me, good job to you
and Sergey. As for helping I¹ll try my best, but I¹m not
an OSGI guru :)

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Nick Burch <ni...@apache.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Sunday, November 23, 2014 at 6:12 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Subsets of tika parsers redux

>Hi All
>
>During ApacheCon, I had a chance to chat with Sergey about the "subset of
>Tika Parsers" issue that bubbles up from time to time. It seemed to work
>well, and I think we both now have a better idea of the other's needs and
>concerns, which is good :)
>
>As is shown on our list from time to time, but more commonly elsewhere,
>we 
>have some users who are confused already by the split between tika-core
>and tika-parsers. Anything that fragments further is going to cause more
>issues for that kind of user.
>
>On the other hand, there are potential users out there who want just a
>handful of parsers, in a simple and easy and small way, who don't know a
>lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those
>are using OSGi, but not all.
>
>One suggested solution is to just document what dependencies of
>tika-parsers can be excluded at the maven level to disable certain
>parsers 
>+ shrink the resulting dependency tree. However, that requires manual
>updates, manual checking, and like our examples on the website risk
>getting out of date without automated checking.
>
>Discussion then turned to our move to get all the examples for the
>website 
>into svn, with unit tests, and having the website pull those from svn on
>the fly to always get the latest tested version.
>
>
>That led to an idea. Not sure if it'll work yet, but...
>
>What about having multiple Tika OSGi bundles? Continue with the "full"
>bundle as now, but also have ones for "pdf", "microsoft office", "images"
>etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if
>they only wanted a handful of parsers, or the full one as now.
>
>The smart bit - we have unit tests for these smaller bundles. These unit
>tests ensure that the desired parsers still work on their smaller bundle.
>These unit tests also ensure that unwanted parsers don't work, thus
>flagging up if extra dependencies have snuck though.
>
>Finally, we pull out the includes/excludes information that went into the
>bundle, and display that for non-OSGi users. A non-OSGi person wanting
>"tika with pdf only" could then look at what the tika-pdf-bundle does and
>doesn't use, and from that know what maven level dependencies to keep and
>which to exclude
>
>
>This new plan would mean having to tweak our build to support multiple
>bundles, and potentially tweaking our bundles so that you could load
>tika-pdf + tika-image and have those two play nicely together. It'd also
>need some new unit tests, and the work to figure out what to
>include/exclude for each of our handful of "common" cases. It should,
>however, deliver a way for OSGi and non-OSGi people to get just a subset
>if that's all they want.
>
>Can anyone see a flaw with this plan? Anyone see a better way? Anyone
>want 
>to help? :)
>
>Nick

Re: SVN access issue

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi, I've had to explicitly switch from svn.eu to svn.us for it to start 
working again...
Thanks, Sergey

On 12/12/14 14:23, Ken Krugler wrote:
>
>> From: Sergey Beryozkin
>> Sent: December 12, 2014 5:27:48am MST
>> To: dev@tika.apache.org
>> Subject: SVN access issue
>>
>> Hi
>>
>> "svn up" reports:
>>
>> Updating '.':
>> svn: E000111: Unable to connect to a repository at URL 'https://svn.apache.org/repos/asf/tika/trunk'
>> svn: E000111: Error running context: Connection refuse
>>
>> Can it be related to the recent infra-related issue or is it just a temp problem ?
>
> Working for me, just tried.
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

RE: SVN access issue

Posted by Ken Krugler <kk...@scaleunlimited.com>.

> From: Sergey Beryozkin
> Sent: December 12, 2014 5:27:48am MST
> To: dev@tika.apache.org
> Subject: SVN access issue
> 
> Hi
> 
> "svn up" reports:
> 
> Updating '.':
> svn: E000111: Unable to connect to a repository at URL 'https://svn.apache.org/repos/asf/tika/trunk'
> svn: E000111: Error running context: Connection refuse
> 
> Can it be related to the recent infra-related issue or is it just a temp problem ?

Working for me, just tried.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

SVN access issue

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi

"svn up" reports:

Updating '.':
svn: E000111: Unable to connect to a repository at URL 
'https://svn.apache.org/repos/asf/tika/trunk'
svn: E000111: Error running context: Connection refuse

Can it be related to the recent infra-related issue or is it just a temp 
problem ?

Thanks, Sergey

Re: Translation API question

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tyler, thanks, I'll watch it

Cheers, Sergey
On 06/05/15 00:21, Tyler Palsulich wrote:
> Hi Sergey,
>
> Unfortunately, not yet. See TIKA-1328.
>
> Tyler
>
> On Tue, May 5, 2015 at 4:51 PM, Sergey Beryozkin <sb...@gmail.com>
> wrote:
>
>> Hi All
>>
>> Is it possible to submit a document to the Translation API and get the
>> translated words as a sequence of events ? For example, with a regular Tika
>> API it is possible to submit a document and get the metadata and the data,
>> and these data can be indexed, etc.
>>
>> What about submitting a document (for ex, French) to the translation API
>> and getting a list of the words in English, so that they can be indexed.
>>
>> I'm thinking, may be one then can use a query to find all the documents in
>> French that contain a given word as it reads in English. Example: find a
>> French doc containing "thanks", etc...
>>
>> Not sure how much sense it makes though :-)
>>
>> Cheers, Sergey
>>
>

Re: Translation API question

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

One thought I had about this was a TranslatingHandler and/or
a LanguageHandler. That IMO may be the best way to do Language
detection and/or translation in general since that way we could
just easily plug into the output of the existing Parsers, etc.

Else I was thinking about creating a ParserDetector class for
TranslatingParserDecorator and/or LanguageParserDetector to
expose both pieces of information.

Thoughts?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Tyler Palsulich <tp...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, May 5, 2015 at 1:21 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: Translation API question

>Hi Sergey,
>
>Unfortunately, not yet. See TIKA-1328.
>
>Tyler
>
>On Tue, May 5, 2015 at 4:51 PM, Sergey Beryozkin <sb...@gmail.com>
>wrote:
>
>> Hi All
>>
>> Is it possible to submit a document to the Translation API and get the
>> translated words as a sequence of events ? For example, with a regular
>>Tika
>> API it is possible to submit a document and get the metadata and the
>>data,
>> and these data can be indexed, etc.
>>
>> What about submitting a document (for ex, French) to the translation API
>> and getting a list of the words in English, so that they can be indexed.
>>
>> I'm thinking, may be one then can use a query to find all the documents
>>in
>> French that contain a given word as it reads in English. Example: find a
>> French doc containing "thanks", etc...
>>
>> Not sure how much sense it makes though :-)
>>
>> Cheers, Sergey
>>

Re: Translation API question

Posted by Tyler Palsulich <tp...@gmail.com>.

Hi Sergey,

Unfortunately, not yet. See TIKA-1328.

Tyler

On Tue, May 5, 2015 at 4:51 PM, Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi All
>
> Is it possible to submit a document to the Translation API and get the
> translated words as a sequence of events ? For example, with a regular Tika
> API it is possible to submit a document and get the metadata and the data,
> and these data can be indexed, etc.
>
> What about submitting a document (for ex, French) to the translation API
> and getting a list of the words in English, so that they can be indexed.
>
> I'm thinking, may be one then can use a query to find all the documents in
> French that contain a given word as it reads in English. Example: find a
> French doc containing "thanks", etc...
>
> Not sure how much sense it makes though :-)
>
> Cheers, Sergey
>

Translation API question

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi All

Is it possible to submit a document to the Translation API and get the 
translated words as a sequence of events ? For example, with a regular 
Tika API it is possible to submit a document and get the metadata and 
the data, and these data can be indexed, etc.

What about submitting a document (for ex, French) to the translation API 
  and getting a list of the words in English, so that they can be indexed.

I'm thinking, may be one then can use a query to find all the documents 
in French that contain a given word as it reads in English. Example: 
find a French doc containing "thanks", etc...

Not sure how much sense it makes though :-)

Cheers, Sergey

Re: Subsets of tika parsers redux

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Nick, I think I've actually learned a new urban dictionary word 
mentioned in this thread, 'faff' :-).

On 16/12/14 03:34, Nick Burch wrote:
> On Mon, 15 Dec 2014, Sergey Beryozkin wrote:
>> I'm not proposing to split tika-parsers in a way that would affect the
>> users, tika-parsers would still be there, except that it would
>> strongly depend on tika-pdf and perhaps, when it is being built, it
>> can have its dependencies like tika-pdf shaded in/merged in to ensure
>> a complete backward-compatibility as far as the user expectations of
>> tika-parsers is concerned.
>
> We still have the additional faff of multiple "core" modules, which
> someone warned about in an earlier thread, and additional work for
> developers, and we did try pulling out the pdf parser which didn't work,
> and I'm finding having the Vorbis parsers in a different module + repo
> to be a faff

I was thinking of introducing a very minimum number of extra modules (at 
the 'expense' of tika-parsers), those covering the mainstream parsers, 
the ones you mentioned earlier, pdf, plus few others. tika-parsers would 
still be effectively the same after the build time, no side-effects for 
the tika-parsers users. Perhaps it is difficult to realize practically...

>
> My plan doesn't involve any of those problems in phase 1 - core +
> parsers don't change at all, so if it doesn't work we haven't got to
> work hard to undo it, and people not interested aren't affected.
>
> Alternately, if you head back to some of the earlier threads on this,
> and can come up with reasons why the objections raised there can be
> overruled, we could hack up tika parsers. (I'm trying to come up with a
> plan that respects previously raised issues)
>
Sounds good, thanks

I might experiment a bit later on and create a patch for the review but 
I'll take a pause for now
Cheers, Sergey
> Nick

Re: Subsets of tika parsers redux

Posted by Nick Burch <ap...@gagravarr.org>.

On Mon, 15 Dec 2014, Sergey Beryozkin wrote:
> I'm not proposing to split tika-parsers in a way that would affect the 
> users, tika-parsers would still be there, except that it would strongly 
> depend on tika-pdf and perhaps, when it is being built, it can have its 
> dependencies like tika-pdf shaded in/merged in to ensure a complete 
> backward-compatibility as far as the user expectations of tika-parsers 
> is concerned.

We still have the additional faff of multiple "core" modules, which 
someone warned about in an earlier thread, and additional work for 
developers, and we did try pulling out the pdf parser which didn't work, 
and I'm finding having the Vorbis parsers in a different module + repo to 
be a faff

My plan doesn't involve any of those problems in phase 1 - core + parsers 
don't change at all, so if it doesn't work we haven't got to work hard to 
undo it, and people not interested aren't affected.

Alternately, if you head back to some of the earlier threads on this, and 
can come up with reasons why the objections raised there can be overruled, 
we could hack up tika parsers. (I'm trying to come up with a plan that 
respects previously raised issues)

Nick

Re: Subsets of tika parsers redux

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi,
On 15/12/14 14:28, Nick Burch wrote:
> On Mon, 15 Dec 2014, Sergey Beryozkin wrote:
>>> OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf,
>>> or tika + tika-parsers-pdf + tika-parsers-mp3 if they want
>>>
>>> OSGi is nicely contained, and fairly easy to unit test, so let's use
>>> that to test out the idea! That also solves the CXF need. Once that
>>> works, and once we have a tested way that everyone can see + understand,
>>> then someone can try to make the case for phase II where we push it to
>>> the maven pom / project level!
>>
>> The need of CXF (Tika) users (or of some other users with possibly
>> similar requirements) is not about shipping OSGI only Tika modules but
>> about having an easy option of not to having include all the
>> tika-parsers. Some CXF users would work with OSGI, some not. Sorry if
>> I did not clarify it.
>
> I see us using OSGi as a way to test it, unit test it, and have unit
> tested documentation for moderately advanced maven users. If we just put
> up a page with "this is what we think you might need to exclude and
> incldue", it'll almost always be wrong... Saying "OSGi users use this,
> others take the info from a green build of the OSGi module" means we can
> have tested docs!

OK.
>
>> As I said, a module marked as "bundle", as opposed to a default 'jar'
>> is just a plain jar with few extra META-INF instructions.
>>
>> Given it, I'm not understanding why you are opposed to not having
>> tika-parsers minimized as I suggested ? What exactly is your concern ?
>
> We have users who get confused by no parsers working when they depend on
> tika-core only. Not so many on the list these days, but loads if you
> look out into the wider internet at other support forums. Those kinds of
> users will only find things worse if the tika parsers get split out.
>
> We also have the massive faff that is maintaining tika parsers outside
> of the tika-parsers module. It seemed a great theory, and we tried it.
> The PDF box one just didn't get picked up or maintained, never really
> left, and the move was abandoned + main parser reverted to being in
> Tika. I did all the Vorbis parser stuff outside as well, as championed
> by the plan, and it has worked out a lot more work for me than if it'd
> been in Tika itself. So, existing scars are another reason!
>
> (That's why I suggested this as a compromise plan - change nothing for
> normal Java users, until we see if it'll work + be of interest or not.
> If it does work for all, case for the main change already made! If it
> doesn't work, there's nothing to un-do)
>
I'm not proposing to split tika-parsers in a way that would affect the 
users, tika-parsers would still be there, except that it would strongly 
depend on tika-pdf and perhaps, when it is being built, it can have its 
dependencies like tika-pdf shaded in/merged in to ensure a complete 
backward-compatibility as far as the user expectations of tika-parsers 
is concerned.
I think it is your main concern, that users of tika-parsers can be 
affected.
Would what I just said above work for all ? I'm hoping yes but may be 
I'm still missing something :-)

Sergey



> Nick

Re: Subsets of tika parsers redux

Posted by Nick Burch <ap...@gagravarr.org>.

On Mon, 15 Dec 2014, Sergey Beryozkin wrote:
>> OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf,
>> or tika + tika-parsers-pdf + tika-parsers-mp3 if they want
>> 
>> OSGi is nicely contained, and fairly easy to unit test, so let's use
>> that to test out the idea! That also solves the CXF need. Once that
>> works, and once we have a tested way that everyone can see + understand,
>> then someone can try to make the case for phase II where we push it to
>> the maven pom / project level!
>
> The need of CXF (Tika) users (or of some other users with possibly similar 
> requirements) is not about shipping OSGI only Tika modules but about having 
> an easy option of not to having include all the tika-parsers. Some CXF users 
> would work with OSGI, some not. Sorry if I did not clarify it.

I see us using OSGi as a way to test it, unit test it, and have unit 
tested documentation for moderately advanced maven users. If we just put 
up a page with "this is what we think you might need to exclude and 
incldue", it'll almost always be wrong... Saying "OSGi users use this, 
others take the info from a green build of the OSGi module" means we can 
have tested docs!

> As I said, a module marked as "bundle", as opposed to a default 'jar' is just 
> a plain jar with few extra META-INF instructions.
>
> Given it, I'm not understanding why you are opposed to not having 
> tika-parsers minimized as I suggested ? What exactly is your concern ?

We have users who get confused by no parsers working when they depend on 
tika-core only. Not so many on the list these days, but loads if you look 
out into the wider internet at other support forums. Those kinds of users 
will only find things worse if the tika parsers get split out.

We also have the massive faff that is maintaining tika parsers outside of 
the tika-parsers module. It seemed a great theory, and we tried it. The 
PDF box one just didn't get picked up or maintained, never really left, 
and the move was abandoned + main parser reverted to being in Tika. I did 
all the Vorbis parser stuff outside as well, as championed by the plan, 
and it has worked out a lot more work for me than if it'd been in Tika 
itself. So, existing scars are another reason!

(That's why I suggested this as a compromise plan - change nothing for 
normal Java users, until we see if it'll work + be of interest or not. If 
it does work for all, case for the main change already made! If it doesn't 
work, there's nothing to un-do)

Nick

Re: Subsets of tika parsers redux

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Nick,
On 15/12/14 14:02, Nick Burch wrote:
> On Mon, 15 Dec 2014, Sergey Beryozkin wrote:
>>> As a first step, I thought we'd still keep the same tika-parser jar, the
>>> only difference would be what dependencies ended up in the bundle. If
>>> the tika-bundle-pdf has no POI jars included in it, then the Microsoft
>>> Office related parsers shouldn't register themselves.
>>>
>>> It would mean that the "pdf bundle" would have the image, microsoft etc
>>> parser code in them, but the parsers wouldn't be registered as their
>>> dependencies wouldn't be there.
>>>
>>> Not sure if this can/will work, but it would mean we can do cut-down
>>> bundles + cut-down-maven-docs, without needing to change anything else.
>>> If it proves popular, we can then re-visit the "giant tika parsers"
>>> question, but if not it shouldn't change anything. Well, that's the
>>> theory... :)
>>>
>>
>> Sorry if I haven't completely understood the idea, I think there's
>> definitely something nice being suggested above, and it sounds to me
>> as if the following can be one possible realization of it, as a first
>> step for example,
>> - add a tika-pdf module, this will be a bundle, so it will work as a
>> jar and as an OSGI bundle; the code for tika-pdf will be extracted
>> (and removed) from tika-parsers
>
> Not quite - I forsee this being OSGi only for now. Tika Parsers project
> would be unchanged, OSGi users could have tika (all) as now, or just
> tika-pdf
>
>> - tika-parsers will get updated to depend on tika-pdf - hence users
>> working with tika-parsers won;t be affected
>
> No, that's a possible phase 2 if it goes well. No change for non-OSGi
> stuff. Non-OSGi users can see the OSGi build to work out what to include
> and exclude if they want. (This means that we have a unit tested way to
> see what you do/don't want, without affecting things for the simple Tika
> users we get confused already with tika-core + tika-parsers)
>
>> - those users who want working with PDF only would ad tika-core +
>> tika-pdf dependencies only
>
> OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf,
> or tika + tika-parsers-pdf + tika-parsers-mp3 if they want
>
>
> OSGi is nicely contained, and fairly easy to unit test, so let's use
> that to test out the idea! That also solves the CXF need. Once that
> works, and once we have a tested way that everyone can see + understand,
> then someone can try to make the case for phase II where we push it to
> the maven pom / project level!
The need of CXF (Tika) users (or of some other users with possibly 
similar requirements) is not about shipping OSGI only Tika modules but 
about having an easy option of not to having include all the 
tika-parsers. Some CXF users would work with OSGI, some not. Sorry if I 
did not clarify it.

As I said, a module marked as "bundle", as opposed to a default 'jar' is 
just a plain jar with few extra META-INF instructions.

Given it, I'm not understanding why you are opposed to not having 
tika-parsers minimized as I suggested ? What exactly is your concern ?

Shipping something like tika-pdf but still keeping the PDF parsing code 
inside tika-parsers is a duplication, right ?

Thanks, Sergey




>
> Nick

Re: Subsets of tika parsers redux

Posted by Nick Burch <ap...@gagravarr.org>.

On Mon, 15 Dec 2014, Sergey Beryozkin wrote:
>> As a first step, I thought we'd still keep the same tika-parser jar, the
>> only difference would be what dependencies ended up in the bundle. If
>> the tika-bundle-pdf has no POI jars included in it, then the Microsoft
>> Office related parsers shouldn't register themselves.
>> 
>> It would mean that the "pdf bundle" would have the image, microsoft etc
>> parser code in them, but the parsers wouldn't be registered as their
>> dependencies wouldn't be there.
>> 
>> Not sure if this can/will work, but it would mean we can do cut-down
>> bundles + cut-down-maven-docs, without needing to change anything else.
>> If it proves popular, we can then re-visit the "giant tika parsers"
>> question, but if not it shouldn't change anything. Well, that's the
>> theory... :)
>> 
>
> Sorry if I haven't completely understood the idea, I think there's definitely 
> something nice being suggested above, and it sounds to me as if the following 
> can be one possible realization of it, as a first step for example,
> - add a tika-pdf module, this will be a bundle, so it will work as a jar and 
> as an OSGI bundle; the code for tika-pdf will be extracted (and removed) from 
> tika-parsers

Not quite - I forsee this being OSGi only for now. Tika Parsers project 
would be unchanged, OSGi users could have tika (all) as now, or just 
tika-pdf

> - tika-parsers will get updated to depend on tika-pdf - hence users working 
> with tika-parsers won;t be affected

No, that's a possible phase 2 if it goes well. No change for non-OSGi 
stuff. Non-OSGi users can see the OSGi build to work out what to include 
and exclude if they want. (This means that we have a unit tested way to 
see what you do/don't want, without affecting things for the simple Tika 
users we get confused already with tika-core + tika-parsers)

> - those users who want working with PDF only would ad tika-core + tika-pdf 
> dependencies only

OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf, or 
tika + tika-parsers-pdf + tika-parsers-mp3 if they want

OSGi is nicely contained, and fairly easy to unit test, so let's use that 
to test out the idea! That also solves the CXF need. Once that works, and 
once we have a tested way that everyone can see + understand, then someone 
can try to make the case for phase II where we push it to the maven pom / 
project level!

Nick

Re: Subsets of tika parsers redux

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Nick

Sorry I haven't responded earlier. Please see a comment below

On 25/11/14 23:11, Nick Burch wrote:
> On Mon, 24 Nov 2014, Sergey Beryozkin wrote:
>> It is an interesting idea, one that can lead to introducing
>> finer-grained bundles but also providing a mechanism for the
>> (auto-)generation of the import metadata required by each of the
>> parser modules. Besides, introducing several smaller bundles that
>> would group most popular formats is a good one on its own IMHO.
>>
>> My doubt here is how many of those bundles we'd need to create and if
>> it will make it easy for users to get a task like "Get a parser for
>> the format A only, or parsers A and B formats only" done.
>
> My hunch, though I've not yet check if it'll work properly, would be for
> something like half a dozen parsers:
>   * pdf
>   * office
>   * audio / video
>   * html
>   * xml-based (xml, odf, epub, atom)
>   * scientific
>   * everything else
>
>> Are we talking about introducing a parser module per every supported
>> format, and having tika-parsers depend on all of those modules, with
>> every parser module becoming a bundle (a jar plus an entry in the
>> manifest) ?
>
> As a first step, I thought we'd still keep the same tika-parser jar, the
> only difference would be what dependencies ended up in the bundle. If
> the tika-bundle-pdf has no POI jars included in it, then the Microsoft
> Office related parsers shouldn't register themselves.
>
> It would mean that the "pdf bundle" would have the image, microsoft etc
> parser code in them, but the parsers wouldn't be registered as their
> dependencies wouldn't be there.
>
> Not sure if this can/will work, but it would mean we can do cut-down
> bundles + cut-down-maven-docs, without needing to change anything else.
> If it proves popular, we can then re-visit the "giant tika parsers"
> question, but if not it shouldn't change anything. Well, that's the
> theory... :)
>

Sorry if I haven't completely understood the idea, I think there's 
definitely something nice being suggested above, and it sounds to me as 
if the following can be one possible realization of it, as a first step 
for example,
- add a tika-pdf module, this will be a bundle, so it will work as a jar 
and as an OSGI bundle; the code for tika-pdf will be extracted (and 
removed) from tika-parsers
- tika-parsers will get updated to depend on tika-pdf - hence users 
working with tika-parsers won;t be affected
- those users who want working with PDF only would ad tika-core + 
tika-pdf dependencies only
- once we see it works we repeat the process for few more mainstream 
formats as you suggested above (XHTML, audio+video, etc), with 
tika-parsers being gradually minimized but still playing the role of the 
everything-else container

Do you see the above being at least somewhat consistent with what you 
suggested ? Would it work ?

Cheers, Sergey

> Nick

Re: Subsets of tika parsers redux

Posted by Nick Burch <ap...@gagravarr.org>.

On Mon, 24 Nov 2014, Sergey Beryozkin wrote:
> It is an interesting idea, one that can lead to introducing finer-grained 
> bundles but also providing a mechanism for the (auto-)generation of the 
> import metadata required by each of the parser modules. Besides, introducing 
> several smaller bundles that would group most popular formats is a good one 
> on its own IMHO.
>
> My doubt here is how many of those bundles we'd need to create and if it will 
> make it easy for users to get a task like "Get a parser for the format A 
> only, or parsers A and B formats only" done.

My hunch, though I've not yet check if it'll work properly, would be for 
something like half a dozen parsers:
  * pdf
  * office
  * audio / video
  * html
  * xml-based (xml, odf, epub, atom)
  * scientific
  * everything else

> Are we talking about introducing a parser module per every supported 
> format, and having tika-parsers depend on all of those modules, with 
> every parser module becoming a bundle (a jar plus an entry in the 
> manifest) ?

As a first step, I thought we'd still keep the same tika-parser jar, the 
only difference would be what dependencies ended up in the bundle. If the 
tika-bundle-pdf has no POI jars included in it, then the Microsoft Office 
related parsers shouldn't register themselves.

It would mean that the "pdf bundle" would have the image, microsoft etc 
parser code in them, but the parsers wouldn't be registered as their 
dependencies wouldn't be there.

Not sure if this can/will work, but it would mean we can do cut-down 
bundles + cut-down-maven-docs, without needing to change anything else. If 
it proves popular, we can then re-visit the "giant tika parsers" question, 
but if not it shouldn't change anything. Well, that's the theory... :)

Nick

Re: Subsets of tika parsers redux

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Nick

Was good talking to you and thanks for initiating this thread.

It is an interesting idea, one that can lead to introducing 
finer-grained bundles but also providing a mechanism for the 
(auto-)generation of the import metadata required by each of the parser 
modules. Besides, introducing several smaller bundles that would group 
most popular formats is a good one on its own IMHO.

My doubt here is how many of those bundles we'd need to create and if it 
will make it easy for users to get a task like "Get a parser for the 
format A only, or parsers A and B formats only" done.

Are we talking about introducing a parser module per every supported 
format, and having tika-parsers depend on all of those modules, with 
every parser module becoming a bundle (a jar plus an entry in the 
manifest) ?

Thanks, Sergey


On 23/11/14 17:12, Nick Burch wrote:
> Hi All
>
> During ApacheCon, I had a chance to chat with Sergey about the "subset
> of Tika Parsers" issue that bubbles up from time to time. It seemed to
> work well, and I think we both now have a better idea of the other's
> needs and concerns, which is good :)
>
> As is shown on our list from time to time, but more commonly elsewhere,
> we have some users who are confused already by the split between
> tika-core and tika-parsers. Anything that fragments further is going to
> cause more issues for that kind of user.
>
> On the other hand, there are potential users out there who want just a
> handful of parsers, in a simple and easy and small way, who don't know a
> lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of
> those are using OSGi, but not all.
>
> One suggested solution is to just document what dependencies of
> tika-parsers can be excluded at the maven level to disable certain
> parsers + shrink the resulting dependency tree. However, that requires
> manual updates, manual checking, and like our examples on the website
> risk getting out of date without automated checking.
>
> Discussion then turned to our move to get all the examples for the
> website into svn, with unit tests, and having the website pull those
> from svn on the fly to always get the latest tested version.
>
>
> That led to an idea. Not sure if it'll work yet, but...
>
> What about having multiple Tika OSGi bundles? Continue with the "full"
> bundle as now, but also have ones for "pdf", "microsoft office",
> "images" etc. OSGi users (eg CXF users) could then opt to depend on
> pdf+image if they only wanted a handful of parsers, or the full one as now.
>
> The smart bit - we have unit tests for these smaller bundles. These unit
> tests ensure that the desired parsers still work on their smaller
> bundle. These unit tests also ensure that unwanted parsers don't work,
> thus flagging up if extra dependencies have snuck though.
>
> Finally, we pull out the includes/excludes information that went into
> the bundle, and display that for non-OSGi users. A non-OSGi person
> wanting "tika with pdf only" could then look at what the tika-pdf-bundle
> does and doesn't use, and from that know what maven level dependencies
> to keep and which to exclude
>
>
> This new plan would mean having to tweak our build to support multiple
> bundles, and potentially tweaking our bundles so that you could load
> tika-pdf + tika-image and have those two play nicely together. It'd also
> need some new unit tests, and the work to figure out what to
> include/exclude for each of our handful of "common" cases. It should,
> however, deliver a way for OSGi and non-OSGi people to get just a subset
> if that's all they want.
>
> Can anyone see a flaw with this plan? Anyone see a better way? Anyone
> want to help? :)
>
> Nick