You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2011/03/03 16:26:36 UTC

[jira] Commented: (JCR-2885) Move tika-parsers dependency to deployment packages

    [ https://issues.apache.org/jira/browse/JCR-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002021#comment-13002021 ] 

Jukka Zitting commented on JCR-2885:
------------------------------------

I moved the dependency from jackrabbit-core in revision 1076635.

At the same time I went through the list of dependencies, and made the following exclusions:

        <exclusions>
          <!-- Exclude the NetCDF and the related commons-httpclient -->
          <!-- libraries since the related NetCDF and HDF file       -->
          <!-- formats are not widely used beyond scientific data.   -->
          <exclusion>
            <groupId>edu.ucar</groupId>
            <artifactId>netcdf</artifactId>
          </exclusion>
          <exclusion>
            <groupId>commons-httpclient</groupId>
            <artifactId>commons-httpclient</artifactId>
          </exclusion>
          <!-- Exclude the Apache MIME4J library as it's used for    -->
          <!-- parsing raw email messages and mbox files, which are  -->
          <!-- typically only needed by a file-based email system.   -->
          <exclusion>
            <groupId>org.apache.james</groupId>
            <artifactId>apache-mime4j</artifactId>
          </exclusion>
          <!-- Exclude the Commons Compress library as we don't want -->
          <!-- to parse compressed archives like zips by default.    -->
          <exclusion>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-compress</artifactId>
          </exclusion>
          <!-- Exclude the ASM library as it's only used for parsing -->
          <!-- Java class files, for which there's typically no need -->
          <!-- in a content repository.                              -->
          <exclusion>
            <groupId>asm</groupId>
            <artifactId>asm</artifactId>
          </exclusion>
          <!-- Exclude the extractor library for EXIF and other      -->
          <!-- image metadata as we normally don't want to parse     -->
          <!-- images for full text indexing.                        -->
          <exclusion>
            <groupId>com.drewnoakes</groupId>
            <artifactId>metadata-extractor</artifactId>
          </exclusion>
          <!-- Exclude the Rome library as we normally don't want to -->
          <!-- parse RSS and Atom feeds for full text indexing.      -->
          <exclusion>
            <groupId>rome</groupId>
            <artifactId>rome</artifactId>
          </exclusion>
          <!-- Exclude the Boilerpipe library as we don't use the    -->
          <!-- BoilerpipeContentHandler functionality from Tika.     -->
          <exclusion>
            <groupId>de.l3s.boilerpipe</groupId>
            <artifactId>boilerpipe</artifactId>
          </exclusion>
        </exclusions>

After these exclusions we'd still keep the following dependencies:

    PDF:         pdfbox, fontbox, jempbox, bcmail, bcprov
    MS Office:   poi, poi-ooxml, poi-ooxml-schemas, poi-scratchpad, xmlbeans
    HTML:        tagsoup

Basic formats like plain text and XML (plus rudimentary support for OpenOffice) are handled with the standard Java class library.

> Move tika-parsers dependency to deployment packages
> ---------------------------------------------------
>
>                 Key: JCR-2885
>                 URL: https://issues.apache.org/jira/browse/JCR-2885
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core, jackrabbit-jca, jackrabbit-webapp
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 2.3.0
>
>
> As discussed on the mailing list, it would be better if the tika-parsers dependency (and all the parser libraries it pulls in transitively) was included in our deployment packages but not directly in jackrabbit-core. This would make it easier for people to set up custom lightweight deployments with no or only partial full text extraction functionality.
> To do this we'll first need to wait for Tika 0.9, as we currently have a custom PDFParser class in jackrabbit-core as a workaround to a problem in Tika 0.8.
> At the same time we should do a more thorough review of the transitive parser dependencies we include. At least the rome and bouncycastle libraries were flagged as potentially unnecessary.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira