You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2011/03/03 16:26:36 UTC
[jira] Commented: (JCR-2885) Move tika-parsers dependency to
deployment packages
[ https://issues.apache.org/jira/browse/JCR-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002021#comment-13002021 ]
Jukka Zitting commented on JCR-2885:
------------------------------------
I moved the dependency from jackrabbit-core in revision 1076635.
At the same time I went through the list of dependencies, and made the following exclusions:
<exclusions>
<!-- Exclude the NetCDF and the related commons-httpclient -->
<!-- libraries since the related NetCDF and HDF file -->
<!-- formats are not widely used beyond scientific data. -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>netcdf</artifactId>
</exclusion>
<exclusion>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
</exclusion>
<!-- Exclude the Apache MIME4J library as it's used for -->
<!-- parsing raw email messages and mbox files, which are -->
<!-- typically only needed by a file-based email system. -->
<exclusion>
<groupId>org.apache.james</groupId>
<artifactId>apache-mime4j</artifactId>
</exclusion>
<!-- Exclude the Commons Compress library as we don't want -->
<!-- to parse compressed archives like zips by default. -->
<exclusion>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
</exclusion>
<!-- Exclude the ASM library as it's only used for parsing -->
<!-- Java class files, for which there's typically no need -->
<!-- in a content repository. -->
<exclusion>
<groupId>asm</groupId>
<artifactId>asm</artifactId>
</exclusion>
<!-- Exclude the extractor library for EXIF and other -->
<!-- image metadata as we normally don't want to parse -->
<!-- images for full text indexing. -->
<exclusion>
<groupId>com.drewnoakes</groupId>
<artifactId>metadata-extractor</artifactId>
</exclusion>
<!-- Exclude the Rome library as we normally don't want to -->
<!-- parse RSS and Atom feeds for full text indexing. -->
<exclusion>
<groupId>rome</groupId>
<artifactId>rome</artifactId>
</exclusion>
<!-- Exclude the Boilerpipe library as we don't use the -->
<!-- BoilerpipeContentHandler functionality from Tika. -->
<exclusion>
<groupId>de.l3s.boilerpipe</groupId>
<artifactId>boilerpipe</artifactId>
</exclusion>
</exclusions>
After these exclusions we'd still keep the following dependencies:
PDF: pdfbox, fontbox, jempbox, bcmail, bcprov
MS Office: poi, poi-ooxml, poi-ooxml-schemas, poi-scratchpad, xmlbeans
HTML: tagsoup
Basic formats like plain text and XML (plus rudimentary support for OpenOffice) are handled with the standard Java class library.
> Move tika-parsers dependency to deployment packages
> ---------------------------------------------------
>
> Key: JCR-2885
> URL: https://issues.apache.org/jira/browse/JCR-2885
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: jackrabbit-core, jackrabbit-jca, jackrabbit-webapp
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Fix For: 2.3.0
>
>
> As discussed on the mailing list, it would be better if the tika-parsers dependency (and all the parser libraries it pulls in transitively) was included in our deployment packages but not directly in jackrabbit-core. This would make it easier for people to set up custom lightweight deployments with no or only partial full text extraction functionality.
> To do this we'll first need to wait for Tika 0.9, as we currently have a custom PDFParser class in jackrabbit-core as a workaround to a problem in Tika 0.8.
> At the same time we should do a more thorough review of the transitive parser dependencies we include. At least the rome and bouncycastle libraries were flagged as potentially unnecessary.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira