You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Christopher Currie (JIRA)" <ji...@apache.org> on 2011/07/26 03:09:09 UTC

[jira] [Created] (TIKA-686) Split tika-parsers into separate components

Split tika-parsers into separate components
-------------------------------------------

                 Key: TIKA-686
                 URL: https://issues.apache.org/jira/browse/TIKA-686
             Project: Tika
          Issue Type: Wish
          Components: parser
    Affects Versions: 0.9
            Reporter: Christopher Currie
            Priority: Minor


The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.

[1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070923#comment-13070923 ] 

Ken Krugler commented on TIKA-686:
----------------------------------

I'm in favor of anything that helps with avoiding dependencies on POI, if all I want to parse are text-ish formats :)

I assume we could still have a tika-parsers that has all of the parsers, which just has dependencies on all of the tika-parser-xxx components.

Note that there's still the issue of some of Tika's functionality gracefully handling missing components. IIRC, some of Tika's configuration is still driven primarily by data, versus some combination of data plus what's available at run time.

> Split tika-parsers into separate components
> -------------------------------------------
>
>                 Key: TIKA-686
>                 URL: https://issues.apache.org/jira/browse/TIKA-686
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Christopher Currie
>            Priority: Minor
>
> The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.
> [1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-686) Split tika-parsers into separate components

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-686.
--------------------------------

    Resolution: Won't Fix

Resolving as Won't Fix as there's no clear consensus on how to proceed. Let's use the mailing list to discuss this and come back to the issue tracker only once there's a concrete plan of action.
                
> Split tika-parsers into separate components
> -------------------------------------------
>
>                 Key: TIKA-686
>                 URL: https://issues.apache.org/jira/browse/TIKA-686
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Christopher Currie
>            Priority: Minor
>
> The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.
> [1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

Posted by "Antoni Mylka (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173147#comment-13173147 ] 

Antoni Mylka commented on TIKA-686:
-----------------------------------

Why keep this issue open?

PdfParser appeared in PdfBox (PDFBOX-1132). Keeping both hardly makes sense and has already been identified as a problem (TIKA-810). Pushing parsers upstream covers the "I'm in favor of anything that helps with avoiding dependencies on POI" use case of Ken. We agree that we keep the dependency from tika-parsers to POI (doubts about that dispelled in http://mail-archives.apache.org/mod_mbox/tika-dev/201112.mbox/%3C4EEBA9CA.9030900%40gmail.com%3E). With this dependency, it will be possible to use the maven exclusion construct, exactly as described in my "I like exclusions better" post. So all known use cases are covered.

Since we can't actually remove the PdfParser from Tika now (as that would definitely be a backward-incompatible change), we should deprecate it, remove it from the /META-INF/services/org.apache.tika.parser.Parser and replace the implementation with a delegation to the pdfbox version, but that would fall within the scope of TIKA-810.

Anyway, this can be closed. The discussion can continue in TIKA-810 and in some new issue for POI.

WDYT?
                
> Split tika-parsers into separate components
> -------------------------------------------
>
>                 Key: TIKA-686
>                 URL: https://issues.apache.org/jira/browse/TIKA-686
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Christopher Currie
>            Priority: Minor
>
> The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.
> [1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071669#comment-13071669 ] 

Nick Burch commented on TIKA-686:
---------------------------------

I'd personally not be in favour of having lots of Tika parser jars - I think it would make things much more complicated, and lead to confusion when people accidentally  missed one out

Instead, is it not better to have parsers log but then bow out when they can't find their dependencies? That way, if you don't want to parse the microsoft office formats you ditch the POI dependencies, keep the standard Tika parser Jar, ignore the warning and you're away

> Split tika-parsers into separate components
> -------------------------------------------
>
>                 Key: TIKA-686
>                 URL: https://issues.apache.org/jira/browse/TIKA-686
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Christopher Currie
>            Priority: Minor
>
> The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.
> [1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

Posted by "Christopher Currie (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071846#comment-13071846 ] 

Christopher Currie commented on TIKA-686:
-----------------------------------------

I admit up front I'm biased toward the dependency management case. From my perspective it's a pain to have to dig into the dependencies and exclude all the ones I don't want.

In the end, I think the key question is "what's the common case?" Is it more common to need a lot of parsers, or just one or two? If it's the former, I think keeping a single jar makes a lot of sense. If it's one or two, then I think having separate jars makes things better, because end-users have a clear path: only care about AutoCAD? Take the DWGParser jar and you're done.

Alternatively, there are other Maven-level options that could be considered that would be an improvement on the current state:

1. Make all of the dependencies of tika-parsers 'optional', except for tika-core. This more closely matches the non-dependency-managed scenario, where the end user is responsible for making sure he or she has all the required dependencies for the parser in question.

2. Create pom-only modules for each parser, that pre-document the depenedency filter. In other words, for each parser 'foo', create a tika-parser-foo pom that depends on tika-parsers but excludes the dependencies that are not needed by that parser. This saves each end user from the work of figuring out the exclusion list by themselves.

Since I'm making the request, I'm happy to volunteer myself for some of the grunt-work for any of these solutions, if resources are needed to get them done.


> Split tika-parsers into separate components
> -------------------------------------------
>
>                 Key: TIKA-686
>                 URL: https://issues.apache.org/jira/browse/TIKA-686
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Christopher Currie
>            Priority: Minor
>
> The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.
> [1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072365#comment-13072365 ] 

Nick Burch commented on TIKA-686:
---------------------------------

Does anyone know of a good resource for how imports, method signatures, includes etc affect when a missing dependency will trigger a problem?

It's all very well having the Parser constructor try a Class.forName and throwing a DependencyMissingException or similar, but if we've done something that means the Parser blows up with a ClassNotFound before the constructor then that's no help...

> Split tika-parsers into separate components
> -------------------------------------------
>
>                 Key: TIKA-686
>                 URL: https://issues.apache.org/jira/browse/TIKA-686
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Christopher Currie
>            Priority: Minor
>
> The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.
> [1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071721#comment-13071721 ] 

Jukka Zitting commented on TIKA-686:
------------------------------------

We already did quite a bit of work towards making Tika degrade gracefully when some dependencies are not present, so for now I'd rather encourage people to exclude those dependencies they don't want instead of having to deal with an explosion of dependencies.

My original idea for the Parser interface was that upstream parser libraries could actually implement the interface directly, so that we wouldn't even need any code in tika-parsers. So far we haven't done that too much because the Parser interface was still evolving, but with the AbstractParser class and the proposed cleanup of the Parser interface in 1.0 we should be in a good position to start pushing the Parser implementations upstream.

For example with POI we could push the entire o.a.tika.parsers.microsoft package up to be maintained and included inside POI as something like o.a.poi.tika, either inside one of the existing POI jars (with tika-core as an optional dependency) or as a separate poi-tika jar. Then people could get MS Office support with dependencies to nothing but tika-core and POI. The tika-parsers component would still exist as a composite that mostly just brings together all known Apache-compatible parser implementations.

> Split tika-parsers into separate components
> -------------------------------------------
>
>                 Key: TIKA-686
>                 URL: https://issues.apache.org/jira/browse/TIKA-686
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Christopher Currie
>            Priority: Minor
>
> The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.
> [1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071769#comment-13071769 ] 

Ken Krugler commented on TIKA-686:
----------------------------------

@Nick - my thought that that we'd have a tika-parsers that had dependencies on all of the parsers, so if you want them all you'd just have to have a dependency on that.

This would be similar to what Jukka talked about, where tika-parsers is a composite that brings all of the individual parsers together.

Though if you're not using a dependency management system, that would make things harder.

@Jukka - what are you concerns about "an explosion of dependencies", if that was the case.

@Jukka - What is your assessment of the current state of affairs in Tika, for gracefully handling missing dependencies? I haven't tracked recent changes, but I thought that we'd run into a new cause of failure when a required library was excluded.





> Split tika-parsers into separate components
> -------------------------------------------
>
>                 Key: TIKA-686
>                 URL: https://issues.apache.org/jira/browse/TIKA-686
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Christopher Currie
>            Priority: Minor
>
> The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.
> [1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

Posted by "Antoni Mylka (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072328#comment-13072328 ] 

Antoni Mylka commented on TIKA-686:
-----------------------------------

FWIW I would say that fewer is better. 

We (Aperture) tried it and overdid this. Long story short: version 1.4 was split into 73 modules, with 31 external dependencies, builds took forever and day-to-day development work was a pain. It was madness. Clearly, with a bit more common sense it might have worked out better, but the key issue was that nobody wanted this and everyone used a special 'onejar' assembly anyway. 

I don't like optional dependencies. I need lots of XML in my pom to make my app work.

I personally like exclusions better. Just it's necessary to make sure that

{{<dependency>
 <groupId>org.apache.tika</groupId>
 <artifactId>tika-parsers</artifactId>
 <exclusions>
   <exclusion>
     <groupId>org.apache.poi</groupId>
     <artifactId>poi</artifactId>
   </exclusion>
   <exclusion>
     <groupId>org.apache.poi</groupId>
     <artifactId>poi-scratchpad</artifactId>
   </exclusion>
   <exclusion>
     <groupId>org.apache.poi</groupId>
     <artifactId>poi-ooxml</artifactId>
   </exclusion>
 </exclusions>
</dependency>}}

... works without ClassNotFoundErrors. (Aperture throws them in such a case right now).

A solution with pom-only modules for each parser are OK as long as the default case is left as it is. The same problem will have to be solved though. If I only want office with poi, then the Tika facade must not initialize the PdfParser even though the class itself is present on the classpath, just its dependencies aren't.

> Split tika-parsers into separate components
> -------------------------------------------
>
>                 Key: TIKA-686
>                 URL: https://issues.apache.org/jira/browse/TIKA-686
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Christopher Currie
>            Priority: Minor
>
> The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.
> [1] http://markmail.org/message/tavirkqhn6r2szrz

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira