You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2008/12/04 18:14:17 UTC

XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Hi,

On Thu, Dec 4, 2008 at 10:59 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> Just one question: Is there interest to do the same tag mapping approach for
> OpenXML (MS Office 2007) files? In my opinion, this is much resource
> friendlier (because it is only extracting text from an XML file) than the
> POI approach of having DOM trees and megabytes of DOM-Tree mappings of the
> OpenXML schema with additional external dependencies.

I agree that directly mapping things from the underlying XML is
probably the most straightforward and easy solution for simple text
extraction.

However, a proper parser library becomes very handy as soon as you
start implementing more complex things like extracting content from
possible attachments or handling encryption. Using an external parser
library also insulates us from a lot of complex details like users
complaining why isn't some content in their documents being extracted.
If we implement parsing inside Tika we also need to take on the burden
of maintaining and supporting that implementation.

In general I'd only implement a parser fully in Tika if the required
amount of code is small (up to a few hundred lines max) and that code
covers all the features we need. The current MP3 parser is a good
example where both requirements are currently satisfied, though if we
want to start supporting some of the more complex MP3 tagging formats
I'd definitely go for an external parser library.

BR,

Jukka Zitting

RE: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Uwe Schindler <us...@pangaea.de>.

Hi,

> On Thu, Dec 4, 2008 at 10:59 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> > Just one question: Is there interest to do the same tag mapping approach
> for
> > OpenXML (MS Office 2007) files? In my opinion, this is much resource
> > friendlier (because it is only extracting text from an XML file) than
> the
> > POI approach of having DOM trees and megabytes of DOM-Tree mappings of
> the
> > OpenXML schema with additional external dependencies.
> 
> I agree that directly mapping things from the underlying XML is
> probably the most straightforward and easy solution for simple text
> extraction.
> 
> However, a proper parser library becomes very handy as soon as you
> start implementing more complex things like extracting content from
> possible attachments or handling encryption. Using an external parser
> library also insulates us from a lot of complex details like users
> complaining why isn't some content in their documents being extracted.
> If we implement parsing inside Tika we also need to take on the burden
> of maintaining and supporting that implementation.
> 
> In general I'd only implement a parser fully in Tika if the required
> amount of code is small (up to a few hundred lines max) and that code
> covers all the features we need. The current MP3 parser is a good
> example where both requirements are currently satisfied, though if we
> want to start supporting some of the more complex MP3 tagging formats
> I'd definitely go for an external parser library.

I thought about this when writing the OpenDocumentParser for OpenOffice. As
the mapping was very simple for these type of documents (just a tag mapping
approach), the code is very short, as you noted. If this is the same with
OpenXML, I would give it a try (but I suspect, M$ made it more complicated
than OpenOffice :-). The cool thing with OpenOffice is, that all document
types (spreadsheets, text and presentations have exactly the same syntax,
very cool). And encryption is not possible (as far as I know) and signed
documents are no problem as its still XML.

Uwe

RE: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Uwe Schindler <uw...@thetaphi.de>.

> I would be remiss to consider implementing parsers in Tika as it really
> defeats the purpose of the project: that is, to be a bridging middleware
> and
> standard interface to parsing libraries, metadata representation, mime
> extraction frameworks and content analysis mechanisms.

I am OK with this, but I would wish to have a simple way to
configure/Plugin/plugout parser with their complete dependencies. If you
write a project using only the PDF and HTML parser, it makes no sense to
pollute your classpath with all these other libraries. If it would be
possible to correctly determine which library and parser is needed for which
document type, there should be away of switching other parsers completely
off (so no ClassNotFoundExceptions are generated when the auto detect parser
hits an unsupported document type).

My problem with this highly sophisticated parser libraries outside of tika
are the classpath pollutions, especially when TIKA gives one standalone.jar,
most people would use it (because they did not know how to do it in another
way), but they do not know what is in it. If they know the correct maven
command, they could use the Tika-only JAR together with the external parser
jars. I had to ask Jukka, how to get the original dependency JARS in my
classpath. Without a binary TIKA release with all JARs as separate files,
its not simple to use without the danger of polluting your classpath. And
because it often takes longer time until TIKA replaces old dependencies to
newer ones (like universal dependencies to XML parsers, jDOM,...) and
versions conflict, it's a horror to solve them (especially with the
standalone.jar). For me it was a problem to work with old nekohtml, because
my projects needed a newer one than supplied with TIKa.

For me the biggest horror was Tomcat before version 6, that shipped with
very old versions of XML parsers and other Commons tools in one big JAR
file. Since them I switched completely to lightweight Jetty with only 3
separate JAR files.

Because of this, I wrote the OpenDocument parser. The external parser
library Jukka propsed in one of the issues, was not useable, because it was
only a blown up DOM-tree representation of the XML structure of OpenDocument
files, not useable for any parser without much work.

Another good approach would be this "library" for TIKA:
http://xml.openoffice.org/sx2ml/ [for TIKA it would only add a XSLT
dependency to classpath (already shipped with JAVA 1.5)]. The parsing would
be a pipeline: sx2ml -(SAX events)-> (X)HTML parser -(SAX events)-> text
(the extra XHTML pipeline is to cleanup the too much style-annotated HTML).
The XSLs could be put into one JAR and loaded with
Class.getResourceStream(). If you like that better, I can try to rewrite the
OpenDocument parser using that, because it is a officially supported part of
the openoffice community.

Hope you understand me.

Uwe

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Stephane Bastian <st...@gmail.com>.

Hi All,

A lot of very good comments and feedback here.
Don't you think now is a good time to capture these comments in the wiki 
(or somewhere else) and decide to address (or not) the most painful issues?

BR,

Stephane Bastian


Jukka Zitting wrote:
> Hi,
>
> On Mon, Dec 8, 2008 at 10:58 PM, Christopher Corbell
> <ch...@gmail.com> wrote:
>   
>> Unfortunately at this stage in Tika I'm not sure that fundamental changes
>> in basic design are possible.
>>     
>
> I think that most parts of the design are still open at least until
> Tika 1.0 after which we may want to consider adopting a strict
> backwards compatibility policy at least until Tika 2.0.
>
> It might also be good to set the user expectations correctly by
> documenting that there may well be backwards-incompatible changes
> during the 0.x cycle.
>
> Currently I think that the basics of the Parser interface are already
> pretty stable and well vetted, but other parts like configuration,
> packaging, metadata handling, the MIME type registry, etc. could still
> do with more attention before 1.0.
>
>   
>> Perhaps this is the distinction between a "toolkit" and a "framework" -
>> Tika definitely seems more like the former than the latter to me.
>>     
>
> That is pretty much the original vision for Tika, i.e. we'd rather
> create a lightweight toolkit that applications can use as they see fit
> instead of a framework that guides application design.
>
> It will be interesting to see what kind of innovation can be achieved
> on top of Tika, and I would very much welcome discussion about such
> ideas on this list.
>
>   
>> Hopefully this feedback has some constructive use to the community;
>> I've been keeping a lid on these concerns for awhile but current threads
>> lured me out.
>>     
>
> Good, more opinions and ideas are always welcome!
>
> BR,
>
> Jukka Zitting
>

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Dec 8, 2008 at 10:58 PM, Christopher Corbell
<ch...@gmail.com> wrote:
> Unfortunately at this stage in Tika I'm not sure that fundamental changes
> in basic design are possible.

I think that most parts of the design are still open at least until
Tika 1.0 after which we may want to consider adopting a strict
backwards compatibility policy at least until Tika 2.0.

It might also be good to set the user expectations correctly by
documenting that there may well be backwards-incompatible changes
during the 0.x cycle.

Currently I think that the basics of the Parser interface are already
pretty stable and well vetted, but other parts like configuration,
packaging, metadata handling, the MIME type registry, etc. could still
do with more attention before 1.0.

> Perhaps this is the distinction between a "toolkit" and a "framework" -
> Tika definitely seems more like the former than the latter to me.

That is pretty much the original vision for Tika, i.e. we'd rather
create a lightweight toolkit that applications can use as they see fit
instead of a framework that guides application design.

It will be interesting to see what kind of innovation can be achieved
on top of Tika, and I would very much welcome discussion about such
ideas on this list.

> Hopefully this feedback has some constructive use to the community;
> I've been keeping a lid on these concerns for awhile but current threads
> lured me out.

Good, more opinions and ideas are always welcome!

BR,

Jukka Zitting

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Niall Pemberton <ni...@gmail.com>.

On Mon, Dec 8, 2008 at 9:58 PM, Christopher Corbell
<ch...@gmail.com> wrote:
> Just to add my lurker's thoughts to this thread, for what it's worth...
>
> Nearly all of the issues raised in this thread (and in the other one I've
> been following on Dublin Core) are to me appropriate to a "middlewarish"
> metadata framework.  Some of the corners that folks are pushing are what I
> hoped to find when I started playing with Tika, and I'm glad to see the
> discussion, but I'm a little pessimistic at the same time.
>
> In general I feel a metadata framework should support both a handy default
> configuration with ready parsers for folks just doing a quick-and-dirty CM
> system, and it should support very granular and resource-efficient
> configuration of exactly the parsers you need - and I'd even go further and
> say there should be a common interface to configure individual parsers to
> only get the individual metadata keys that you need.
>
> I think the framework should have its own set of project-managed parsers to
> be extended as the project matures, and facilities to wrap external parsers;
> it isn't complete if it exludes either, but any user's configuration should
> be able to include or exclude whatever parsers you like easily.
>
> The framework I need should negotiate, report and help avoid collisions in
> both standard (e.g. Dublin-core) and vendor-defined metadata keys.  It
> should support use of a downstream parser to "override" as well as extend
> behavior of an upstream parser.  It should support user-defined (not just
> parser-defined) namespaces to permit more than one parser to process the
> same file without overwriting each other's data for commonly named keys. It
> should also support "synthetic" parsers that take upstream metadata and
> synthesize new metadata, or perhaps simply inject user-defined metadata such
> as keywords, processing date, an expiration date or similar.  All of these
> requirements are implemented in a DAM product line I've worked on and are
> driven by modifiablity use cases from the real world.
>
> The middlewarish metadata framework should have a story for establishing
> multiple distinct metadata-parsing "engine" instances so that for example a
> single CM or DAM system could supply instances specific to different
> organizational departments or workflows; for example Creative might need a
> completely different set of metadata to search on for a parsed PDF than
> Legal would need, but the assets are being stored in the same ECM system.
> It's also not uncommon for a customer in a DAM workflow to set up a specific
> set of parsing preferences for a -single- batch of files to be processed.
>
> Finally, it should be an object-oriented interface which hands your Java
> code an Object (some simple Map of key-values) that can then readily be
> converted to XML (preferably via JAXB) or whatever else is needed, possibly
> with some optional framework-supplied transformations downstream from this
> purely structural XML to other formats.  In other words, XHTML should be an
> option for transformed output for those who need XHTML, it should not be the
> default output of the framework.
>
> All of this to me points to basic design and architecture issues, not to
> incremental improvement or enhancement.  Unfortunately at this stage in Tika
> I'm not sure that fundamental changes in basic design are possible. As
> stated in How the ASF
> works<http://www.apache.org/foundation/how-it-works.html#incubator>,
> "the friction that is developed during the initial design stage is likely to
> fragment the community."  That's probably also true if one were to propose a
> major non-initial redesign stage.
>
> So I'm not sure it's possible to radically change Tika design at this point
> to meet my needs; more likely I'll use it opportunistically to find parsers
> I don't know about or just to steal a bit of code here and there.  Perhaps
> this is the distinction between a "toolkit" and a "framework" - Tika
> definitely seems more like the former than the latter to me.  But maybe
> others have a clearer vision of how to do things like this with an evolving
> Tika.
>
> Also perhaps others on the list are happy with the use cases that Tika
> currently satisfies; I don't mean to slight the project - I'm sure it's
> meeting the needs of many.  Hopefully this feedback has some constructive
> use to the community; I've been keeping a lid on these concerns for awhile
> but current threads lured me out.

Nothing ever changes without people stepping forward to first
propose/discuss and then following up with actual contributions - so
if you give up without trying, then your pessimistic outcome is
assured. Perhpas you're right and your proposals won't be accepted -
but give it a try at least. I would suggest picking one concrete
proposal - discuss it first, but be prepared to back it up with
code/patches - and see how that goes.

Niall (fellow lurker)

> - Chris
>

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Christopher Corbell <ch...@gmail.com>.

Just to add my lurker's thoughts to this thread, for what it's worth...

Nearly all of the issues raised in this thread (and in the other one I've
been following on Dublin Core) are to me appropriate to a "middlewarish"
metadata framework.  Some of the corners that folks are pushing are what I
hoped to find when I started playing with Tika, and I'm glad to see the
discussion, but I'm a little pessimistic at the same time.

In general I feel a metadata framework should support both a handy default
configuration with ready parsers for folks just doing a quick-and-dirty CM
system, and it should support very granular and resource-efficient
configuration of exactly the parsers you need - and I'd even go further and
say there should be a common interface to configure individual parsers to
only get the individual metadata keys that you need.

I think the framework should have its own set of project-managed parsers to
be extended as the project matures, and facilities to wrap external parsers;
it isn't complete if it exludes either, but any user's configuration should
be able to include or exclude whatever parsers you like easily.

The framework I need should negotiate, report and help avoid collisions in
both standard (e.g. Dublin-core) and vendor-defined metadata keys.  It
should support use of a downstream parser to "override" as well as extend
behavior of an upstream parser.  It should support user-defined (not just
parser-defined) namespaces to permit more than one parser to process the
same file without overwriting each other's data for commonly named keys. It
should also support "synthetic" parsers that take upstream metadata and
synthesize new metadata, or perhaps simply inject user-defined metadata such
as keywords, processing date, an expiration date or similar.  All of these
requirements are implemented in a DAM product line I've worked on and are
driven by modifiablity use cases from the real world.

The middlewarish metadata framework should have a story for establishing
multiple distinct metadata-parsing "engine" instances so that for example a
single CM or DAM system could supply instances specific to different
organizational departments or workflows; for example Creative might need a
completely different set of metadata to search on for a parsed PDF than
Legal would need, but the assets are being stored in the same ECM system.
It's also not uncommon for a customer in a DAM workflow to set up a specific
set of parsing preferences for a -single- batch of files to be processed.

Finally, it should be an object-oriented interface which hands your Java
code an Object (some simple Map of key-values) that can then readily be
converted to XML (preferably via JAXB) or whatever else is needed, possibly
with some optional framework-supplied transformations downstream from this
purely structural XML to other formats.  In other words, XHTML should be an
option for transformed output for those who need XHTML, it should not be the
default output of the framework.

All of this to me points to basic design and architecture issues, not to
incremental improvement or enhancement.  Unfortunately at this stage in Tika
I'm not sure that fundamental changes in basic design are possible. As
stated in How the ASF
works<http://www.apache.org/foundation/how-it-works.html#incubator>,
"the friction that is developed during the initial design stage is likely to
fragment the community."  That's probably also true if one were to propose a
major non-initial redesign stage.

So I'm not sure it's possible to radically change Tika design at this point
to meet my needs; more likely I'll use it opportunistically to find parsers
I don't know about or just to steal a bit of code here and there.  Perhaps
this is the distinction between a "toolkit" and a "framework" - Tika
definitely seems more like the former than the latter to me.  But maybe
others have a clearer vision of how to do things like this with an evolving
Tika.

Also perhaps others on the list are happy with the use cases that Tika
currently satisfies; I don't mean to slight the project - I'm sure it's
meeting the needs of many.  Hopefully this feedback has some constructive
use to the community; I've been keeping a lid on these concerns for awhile
but current threads lured me out.

- Chris

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Dec 8, 2008 at 2:05 PM, Nadav Har'El <ny...@math.technion.ac.il> wrote:
> Yes, I admitted that it is a scary idea, but in the long run, what *will*
> the Tika developers do if indeed there is a bug in a specific PDF construct?
> Hope that the PDFbox developers fix it?

Yes. We file an issue at PDFBox and upgrade to the next release that
fixes that problem. If we actually *know* how to fix the problem then
we attach the patch to that issue.

If the PDFBox developers are unresponsive, then we start looking for
some other PDF parser library that better meets our needs.

Forking PDFBox is IMHO only the very last option after the above
alternatives are exhausted. And even then I wouldn't move the code to
Tika, but rather start a new project where people interested in PDF
(as opposed to generic text extraction) can come together and work on
the code.

> If indeed such a symbiosis is possible, then it will indeed be great because
> the size issue becomes moot (although others like the "dll hell" mentioned
> earlier don't). I just just wonder if we have any power to influence these
> other projects.

Most projects I know are eager to welcome people with good ideas and
patches. The basic principle of open source development is that you
influence projects by contributing to them. I don't see any parser
project rejecting a proposal like "it would be helpful for Tika if you
adopted this patch I've prepared" unless the patch itself is crappy.

BR,

Jukka Zitting

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Nadav Har'El <ny...@math.technion.ac.il>.

On Mon, Dec 08, 2008, Jukka Zitting wrote about "Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)":
> As mentioned before, I don't think that's a good idea as we don't have
> the required format-specific expertise here and IMHO trying to gather
> such expertise into a single project and community would be a futile
> exercise.
> 
> I really don't want to start dealing with detailed questions about why
> this specific PDF construct or Office XML feature is not supported by
> Tika. It's an ocean of trouble that we're in no way equipped to
> handle.

Yes, I admitted that it is a scary idea, but in the long run, what *will*
the Tika developers do if indeed there is a bug in a specific PDF construct?
Hope that the PDFbox developers fix it?

> > I would be even happier if those projects made the separation
> > themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which is
> > exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am
> > doubtful that this will happen on its own any time soon.
> 
> This sounds like a much more fertile approach and I don't think it's
> all that far fetched. Now with Tika we have a clear rationale why such
> a trimmed down component would be useful, and I think many parser
> libraries would be happy to consider such proposals.
> 
> I'm currently mentoring the PDFBox project at the Apache Incubator,
> and I think the project would respond really well if someone came up
> there with a proposal of generating such an extra pdfbox-extract
> release artifact.

This sounds like a really interesting idea.

In a desktop search engine I once wrote, pdfbox was *half* of the entire
installer size, as big as the sum of Lucene, the search application,
web server and servlet engine, Lucene, and a dozen other parsers and
libraries. From looking around the PDFBox code I got the impression that
text extraction was only a small part of the code, but it was almost
impossible to separate it from the rest (I managed to pull out some of the
stuff, but not all of it). I would love to see such a separation made easier.
But please note that I'm not merely suggesting separating PDF *reading* and
and *writing* - it is more than that - even *reading* pdf involves more than
just extracting text, and we don't need any of that in Tika. The biggest
thing we don't need in Tika, for example, is the font handling code.

> We can and should work together with the parser projects to address
> the requirements we see. That's a much better alternative than forking
> parts of those projects inside Tika.

If indeed such a symbiosis is possible, then it will indeed be great because
the size issue becomes moot (although others like the "dll hell" mentioned
earlier don't). I just just wonder if we have any power to influence these
other projects.

Thanks,
Nadav.

-- 
Nadav Har'El                        |      Monday, Dec  8 2008, 11 Kislev 5769
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Wear short sleeves! Support your right to
http://nadav.harel.org.il           |bare arms!

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.

Hi Jukka,

>
> We can and should work together with the parser projects to address
> the requirements we see. That's a much better alternative than forking
> parts of those projects inside Tika.

+1. This was the decision, as I mentioned, after a number of years of seeing
parser explosion and forking in Nutch. The idea is, all of these communities
do have their own expertise and code bases. Tika is like the common driver
(or "middleware") to each.

Cheers,
Chris

>
> BR,
>
> Jukka Zitting
>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Dec 8, 2008 at 8:44 AM, Nadav Har'El <ny...@math.technion.ac.il> wrote:
> I don't think this is a pipe-dream - I believe (but correct me if I got the
> wrong impression...) that only a minority of the code in PDFBox and POI for
> example is relevant at all to Tika, and that Tika could do better by copying
> only the relevant parts of the code rather than using the whole code as
> a black box.

As mentioned before, I don't think that's a good idea as we don't have
the required format-specific expertise here and IMHO trying to gather
such expertise into a single project and community would be a futile
exercise.

I really don't want to start dealing with detailed questions about why
this specific PDF construct or Office XML feature is not supported by
Tika. It's an ocean of trouble that we're in no way equipped to
handle.

> I would be even happier if those projects made the separation
> themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which is
> exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am
> doubtful that this will happen on its own any time soon.

This sounds like a much more fertile approach and I don't think it's
all that far fetched. Now with Tika we have a clear rationale why such
a trimmed down component would be useful, and I think many parser
libraries would be happy to consider such proposals.

I'm currently mentoring the PDFBox project at the Apache Incubator,
and I think the project would respond really well if someone came up
there with a proposal of generating such an extra pdfbox-extract
release artifact.

We can and should work together with the parser projects to address
the requirements we see. That's a much better alternative than forking
parts of those projects inside Tika.

BR,

Jukka Zitting

RE: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hallo Stephane,

> If the size of the dependencies is the main concern, we could use the
> minijar maven plugin (http://mojo.codehaus.org/minijar-maven-plugin/)
> which creates a mini version of the dependencies used by tika The plugin
> analyses classes and only keeps the ones actually used by Tika. If we
> need something more powerful we could then use Proguard:
> http://proguard.sourceforge.net/

But then you have still the problem of polluting the classpath with class
files you cannot control. If for example in the project using the minified
standalone TIKA JAR and one of the dependencies is also needed (e.g.
nekohtml for own developments), but this project uses more class files of
the bundle (because it needs some extra-feature of nekohtml) you have a
problem. Your project may then choose to add nekohtml to his own lib path,
maybe in a newer version. Depending on the position of the TIKA standalone
JAR file in the classpath, you have two symptoms: It fails completely,
because the few classes from TIKA mix with the newer versions of
nekohtml-full and this creates incomptibilities (tika is before the
project-classes, because internal implementations may not keep consistent
cross-version) or it works (tika comes at the end). If you have more of such
dependencies in your project, it is hard to figure out.

In my opinion: Third party JARS should *always* kept in its original JAR
file, with original name and version number. If you really want to have such
things like tika-standalone.jar for the command line interface, clearly note
to end user, that it is *not* for including in projects, that the JAR file
is *only* for the CLI version (see my other mail mentioning my Tomcat<6
problems)

If somebody uses TIKA as parser plugin in own java-developments, he can scan
the supplied JAR files (if they are hopefully supplied in the binary release
soon) and choose the ones he needs (and for that it would be good to have a
map somewhere: if you want to support this and that document format you need
JAR files a,b,c).

If he want to minimize them, he chooses the tools you proposed. I think this
is the way to go.

Uwe

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Nadav Har'El <ny...@math.technion.ac.il>.

On Mon, Dec 08, 2008, Stephane Bastian wrote about "Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)":
> If the size of the dependencies is the main concern, we could use the 
> minijar maven plugin (http://mojo.codehaus.org/minijar-maven-plugin/) 

This is a very interesting idea. Thanks! I wasn't aware of this tool.
Maybe I'll use it for some of my projects :-)

However, I fear that this technique underestimates by a longshot the
possible saving in the case of Tika. The most important example is:

> Original length of pdfbox-0.7.3.jar was 3321762 bytes. Was able shrink 
> it to pdfbox-0.7.3-mini.jar at 3076844 bytes (92%)

If you ever looked at the PDFBox code, it is evident (at I think it is...)
that Tika needs far less than 92% of the code. PDFBox can create PDF files,
modify them, render them, fill forms, and countless other features that are
completely irrelevant to Tika. The problem is that PDFBox is written in a
way that even when you just want to extract text, all sort of irrelevant
code (like font handling) gets referred. I once tried manually to reduce
the size of pdfbox's jar by removing what I though was irrelevant, and
only partially succeeded because the code was so entangled.

Anyway, the size of the dependencies is indeed a major concern, but like I
said it isn't the only concern (I mentioned a couple others in my post)
Uwe brought up today another interesting concern: the sheer number of
dependencies and their versions is very hard to use correctly. If Tika depends
on a dozen libraries, it will probably require you to get libary A version X,
library B version Y, and so on, and if you get one of these versions wrong
it might fail in misterious ways.

Nadav.

-- 
Nadav Har'El                        |      Monday, Dec  8 2008, 11 Kislev 5769
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Red meat is not bad for you: fuzzy green
http://nadav.harel.org.il           |meat is bad for you.

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Stephane Bastian <st...@gmail.com>.

Hi,

If the size of the dependencies is the main concern, we could use the 
minijar maven plugin (http://mojo.codehaus.org/minijar-maven-plugin/) 
which creates a mini version of the dependencies used by tika The plugin 
analyses classes and only keeps the ones actually used by Tika. If we 
need something more powerful we could then use Proguard: 
http://proguard.sourceforge.net/

I modified pom.xml and run the minijar plugin. Here are the results if 
you're interested:

Can remove 3969 of 5601 classes (70%).
Original length of commons-io-1.4.jar was 109043 bytes. Was able shrink 
it to commons-io-1.4-mini.jar at 19022 bytes (17%)
Original length of poi-3.1-FINAL.jar was 1344956 bytes. Was able shrink 
it to poi-3.1-FINAL-mini.jar at 852767 bytes (63%)
Original length of log4j-1.2.14.jar was 367444 bytes. Was able shrink it 
to log4j-1.2.14-mini.jar at 98397 bytes (26%)
Original length of poi-scratchpad-3.1-FINAL.jar was 721469 bytes. Was 
able shrink it to poi-scratchpad-3.1-FINAL-mini.jar at 469216 bytes (65%)
Original length of commons-logging-1.0.4.jar was 38015 bytes. Was able 
shrink it to commons-logging-1.0.4-mini.jar at 15839 bytes (41%)
Original length of commons-lang-2.1.jar was 207723 bytes. Was able 
shrink it to commons-lang-2.1-mini.jar at 106922 bytes (51%)
Original length of bcmail-jdk14-136.jar was 189233 bytes. Was able 
shrink it to bcmail-jdk14-136-mini.jar at 10639 bytes (5%)
Original length of jempbox-0.2.0.jar was 42669 bytes. Was able shrink it 
to jempbox-0.2.0-mini.jar at 234 bytes (0%)
No references to jar icu4j-3.8.jar. You can safely omit that dependency.
Original length of bcprov-jdk14-136.jar was 1401560 bytes. Was able 
shrink it to bcprov-jdk14-136-mini.jar at 142698 bytes (10%)
Original length of xml-apis-1.3.03.jar was 195119 bytes. Was able shrink 
it to xml-apis-1.3.03-mini.jar at 64806 bytes (33%)
Original length of fontbox-0.1.0.jar was 63159 bytes. Was able shrink it 
to fontbox-0.1.0-mini.jar at 61322 bytes (97%)
Original length of commons-codec-1.3.jar was 46725 bytes. Was able 
shrink it to commons-codec-1.3-mini.jar at 12439 bytes (26%)
Original length of asm-3.1.jar was 43033 bytes. Was able shrink it to 
asm-3.1-mini.jar at 39590 bytes (91%)
Original length of xercesImpl-2.8.1.jar was 1212965 bytes. Was able 
shrink it to xercesImpl-2.8.1-mini.jar at 101989 bytes (8%)
Original length of nekohtml-1.9.9.jar was 116162 bytes. Was able shrink 
it to nekohtml-1.9.9-mini.jar at 92625 bytes (79%)
Original length of pdfbox-0.7.3.jar was 3321762 bytes. Was able shrink 
it to pdfbox-0.7.3-mini.jar at 3076844 bytes (92%)

BR,

Stephane

Nadav Har'El wrote:
> I know my following question and proposal is a bit heretic, but I think it
> needs to be raised nontheless...
>
> On Sun, Dec 07, 2008, Mattmann, Chris A wrote about "Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)":
>   
>> I would be remiss to consider implementing parsers in Tika as it really
>> defeats the purpose of the project: that is, to be a bridging middleware and
>> standard interface to parsing libraries, metadata representation, mime
>> extraction frameworks and content analysis mechanisms.
>>     
>
> The more I think about this issue, the less I am convinced that Tika as a
> wrapper for a dozen other libraries is the right way to go in the long run.
> We are facing several problems:
>
> 1. The big libraries such as PDFBox, POI, etc. often do much more than Tika
>    wants to do - they can understand the document fonts, layout, graphics
>    and other things Tika doesn't care about, and can create new documents or
>    modify existing documents. As a result, these libraries are very complex,
>    and very large - heck, each of PDFBox and POI is larger than the whole of
>    Lucene (Tika's parent project)! This is a problem for applications that
>    need text extraction but wish to remain small or at least not huge
>    (think: desktop search, mobile devices, etc.).
>
> 2. There are many formats that will need new code anyway, such as OpenOffice,
>    MP3, and other parsers which Tika already has code for. Very soon (if
>    it is not true already), most of the parsers will be in Tika's code
>    anyway, and not in external projects.
>
> 3. It is hard to enforce behavior standards and policies on the individual
>    libraries. Each library makes its own decision whether it works by
>    streaming or allocating the whole file, how it handles memory constraints
>    (to avoid OOM exceptions which are catastrophic to the application), what
>    kind of metadata it knows how to extract, and so on. This makes it harder
>    for Tika to give consistent behavior.
>
> What I'd really love to see materialize is a relatively small stand-alone
> library (500K sounds a reasonable figure ;-)), which can extract text and a
> bit of metadata from all sorts of file formats, but nothing more. I would
> love to see Tika become this sort of library.
>
> I don't think this is a pipe-dream - I believe (but correct me if I got the
> wrong impression...) that only a minority of the code in PDFBox and POI for
> example is relevant at all to Tika, and that Tika could do better by copying
> only the relevant parts of the code rather than using the whole code as
> a black box. I would be even happier if those projects made the separation
> themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which is
> exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am
> doubtful that this will happen on its own any time soon.
>
> What I am suggesting may sound scary at first, but I think that in the long
> run, this is the right way to go forward.
>
> Nadav.
>
>

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by Nadav Har'El <ny...@math.technion.ac.il>.

I know my following question and proposal is a bit heretic, but I think it
needs to be raised nontheless...

On Sun, Dec 07, 2008, Mattmann, Chris A wrote about "Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)":
> I would be remiss to consider implementing parsers in Tika as it really
> defeats the purpose of the project: that is, to be a bridging middleware and
> standard interface to parsing libraries, metadata representation, mime
> extraction frameworks and content analysis mechanisms.

The more I think about this issue, the less I am convinced that Tika as a
wrapper for a dozen other libraries is the right way to go in the long run.
We are facing several problems:

1. The big libraries such as PDFBox, POI, etc. often do much more than Tika
wants to do - they can understand the document fonts, layout, graphics
and other things Tika doesn't care about, and can create new documents or
modify existing documents. As a result, these libraries are very complex,
and very large - heck, each of PDFBox and POI is larger than the whole of
Lucene (Tika's parent project)! This is a problem for applications that
need text extraction but wish to remain small or at least not huge
(think: desktop search, mobile devices, etc.).

2. There are many formats that will need new code anyway, such as OpenOffice,
MP3, and other parsers which Tika already has code for. Very soon (if
it is not true already), most of the parsers will be in Tika's code
anyway, and not in external projects.

3. It is hard to enforce behavior standards and policies on the individual
libraries. Each library makes its own decision whether it works by
streaming or allocating the whole file, how it handles memory constraints
(to avoid OOM exceptions which are catastrophic to the application), what
kind of metadata it knows how to extract, and so on. This makes it harder
for Tika to give consistent behavior.

What I'd really love to see materialize is a relatively small stand-alone
library (500K sounds a reasonable figure ;-)), which can extract text and a
bit of metadata from all sorts of file formats, but nothing more. I would
love to see Tika become this sort of library.

I don't think this is a pipe-dream - I believe (but correct me if I got the
wrong impression...) that only a minority of the code in PDFBox and POI for
example is relevant at all to Tika, and that Tika could do better by copying
only the relevant parts of the code rather than using the whole code as
a black box. I would be even happier if those projects made the separation
themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which is
exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am
doubtful that this will happen on its own any time soon.

What I am suggesting may sound scary at first, but I think that in the long
run, this is the right way to go forward.

Nadav.

--
Nadav Har'El | Monday, Dec 8 2008, 11 Kislev 5769
nyh@math.technion.ac.il |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I have a great signature, but it won't
http://nadav.harel.org.il |fit at the end of this message -- Fermat

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.

I'd like to add my 2 cents to this. The whole idea of Tika actually grew out
of the fact that we tried this already in Nutch, and we realized that
implementing your own parser libraries and glue code/interfaces:

1. was not developer-friendly: the maintainers of these parsing libraries
typically like to branch off and work on the intricacies of maintaining the
parser specific code in their own forum, which makes it difficult to keep
the current parsing code in your system up to date.

2. was a hindrance to reuse: we had different versions and forks of existing
parser libraries (e.g., see parse-rss) which fell out of date with the
externally developed versions

3. did not scale

I would be remiss to consider implementing parsers in Tika as it really
defeats the purpose of the project: that is, to be a bridging middleware and
standard interface to parsing libraries, metadata representation, mime
extraction frameworks and content analysis mechanisms.

Thanks!

Cheers,
Chris

On 12/4/08 9:14 AM, "Jukka Zitting" <ju...@gmail.com> wrote:

> Hi,
>
> On Thu, Dec 4, 2008 at 10:59 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>> Just one question: Is there interest to do the same tag mapping approach for
>> OpenXML (MS Office 2007) files? In my opinion, this is much resource
>> friendlier (because it is only extracting text from an XML file) than the
>> POI approach of having DOM trees and megabytes of DOM-Tree mappings of the
>> OpenXML schema with additional external dependencies.
>
> I agree that directly mapping things from the underlying XML is
> probably the most straightforward and easy solution for simple text
> extraction.
>
> However, a proper parser library becomes very handy as soon as you
> start implementing more complex things like extracting content from
> possible attachments or handling encryption. Using an external parser
> library also insulates us from a lot of complex details like users
> complaining why isn't some content in their documents being extracted.
> If we implement parsing inside Tika we also need to take on the burden
> of maintaining and supporting that implementation.
>
> In general I'd only implement a parser fully in Tika if the required
> amount of code is small (up to a few hundred lines max) and that code
> covers all the features we need. The current MP3 parser is a good
> example where both requirements are currently satisfied, though if we
> want to start supporting some of the more complex MP3 tagging formats
> I'd definitely go for an external parser library.
>
> BR,
>
> Jukka Zitting
>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.