You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Tad Wimmer <tw...@spillman.com> on 2013/09/26 20:45:23 UTC

Extracting Metadata from MS Office (2007 +) Files on Glassfish

Hello.
I'm a Tika newbie, and running into an issue with Tika on Glassfish.
I'm using Tika to extract metadata from documents uploaded to a JSF 2.0 Web application using Prime Faces p:fileupload (Prime Faces 3.5) and running on Glassfish 3.2.2. Here is the essentials of my code:
private void extract(InputStream stream, Parser parser, Metadata metadata) {
    try {
        parser.parse(stream, new BodyContentHandler(), metadata, new ParseContext());
    } catch (IOException | SAXException | TikaException e) {
        LOGGER.debug("Exception parsing file for metadata: {}", e);
    }
}
Passing in an AutoDetectParser with the InputStream from the fileupload works fine for MS Office OOXML (.docx, .xlsx, etc.) documents in JUnit tests, but fails when I run it in the Glassfish container with the following stack trace:
Caused by: java.lang.NoClassDefFoundError: org/dom4j/Namespace
at org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.<clinit>(PackagePropertiesUnmarshaller.java:49) ~[na:3.6]
at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149) ~[na:3.6]
at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:136) ~[na:3.6]
at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54) ~[na:3.6]
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:81) ~[na:3.6]
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220) ~[na:3.6]
at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:86) ~[na:3.6]
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:53) ~[na:na]
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69) ~[na:na]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132) ~[na:na]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99) ~[na:na]
at com.spillman.fileupload.FileMetadataExtractor.extract(FileMetadataExtractor.java:220) ~[FileMetadataExtractor.class:na]
...
The tika-core-0.7, tika-parsers-0.7, tika-app-0.8, poi-3.6, poi-ooxml-3.6, and dom4j-1.6.1 jars are all in the build path and marked for export. I've even gone so far as to put the jars in the Glassfish endorsed directory. Web research hasn't produced anything directly related, but I did find a few references to this exception in CF related to class loader conflicts that I wasn't able extrapolate to our Glassfish implementation (Which may be a lack of understanding on my part). What do I need to configure or change to get this to work on Glassfish?
Thanks in advance,

Tad

TAD B WIMMER | Spillman Technologies | JAVA DEVELOPER - HOSTED SOLUTIONS
Toll Free 800.860.8026 ext. 1747 | Phone 801.902.1747 | Fax 801.902.1210
4625 Lake Park Blvd., Salt Lake City, UT 84120
twimmer@spillman.com<ma...@spillman.com> | www.spillman.com<http://www.spillman.com/> | www.citadex.com<http://www.citadex.com/com/>

RE: Extracting Metadata from MS Office (2007 +) Files on Glassfish

Posted by Tad Wimmer <tw...@spillman.com>.

Thanks Nick.  

I've upgraded to Tika 1.4 and added all of the rest of the jars listed on the Maven repository site as dependencies.  Seems to have cleared the NoClassDefFound problem. 

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Thursday, September 26, 2013 12:54 PM
To: user@tika.apache.org
Subject: Re: Extracting Metadata from MS Office (2007 +) Files on Glassfish

On Thu, 26 Sep 2013, Tad Wimmer wrote:
> The tika-core-0.7, tika-parsers-0.7, tika-app-0.8,

Firstly, tika-0.7 is rather old, you should upgrade. Secondly, the tika-app jar is standalone and shouldn't be included in a webapp. Thirdly, all the jars need to be from the same version of tika, you can't mix and match!

> poi-3.6, poi-ooxml-3.6, and dom4j-1.6.1 jars are all in the build path 
> and marked for export

Tika has rather more dependencies than just those 3! I suspect you're missing some more

I'd suggest you move to the latest version of tika, and include all the 
dependencies

Nick

Re: Extracting Metadata from MS Office (2007 +) Files on Glassfish

Posted by Nick Burch <ap...@gagravarr.org>.

On Thu, 26 Sep 2013, Tad Wimmer wrote:
> The tika-core-0.7, tika-parsers-0.7, tika-app-0.8,

Firstly, tika-0.7 is rather old, you should upgrade. Secondly, the 
tika-app jar is standalone and shouldn't be included in a webapp. Thirdly, 
all the jars need to be from the same version of tika, you can't mix and 
match!

> poi-3.6, poi-ooxml-3.6, and dom4j-1.6.1 jars are all in the build path 
> and marked for export

Tika has rather more dependencies than just those 3! I suspect you're 
missing some more

I'd suggest you move to the latest version of tika, and include all the 
dependencies

Nick