You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Marc Speck <ma...@gmail.com> on 2011/02/24 14:09:04 UTC
Search for Office 2007 in OSGi environment
We are testing an update from Jackrabbit 2.1.1 to 2.2.4 in a OSGi
environment similar to the sling.jcr.server bundle. I found the following
issues with search:
- JCR-2838 introduces a dependency of Tika to
org.apache.jackrabbit.core.query.pdf.PDFParser. tika-bundle-0.8 cannot
import that class.
- tika-config.xml configures now org.apache.tika.parser.DefaultParser but
defaultParser.getParser().size() == 0. This seems to be a classloading issue
because new DefaultParser(DefaultParser.class.getClassLoader()) works fine
(see patch below)
- Falling back to SearchIndex.textFilterClasses does not work for Office
2007: JackrabbitParser defines OfficeParser for Office 2007 even though
OfficeParser fails to read the format according to log warnings. I
guess OOXMLParser would work (see patch below)
- tika-bundle-0.9 does not embed xmlbeans-qname that xmlbeans seems to
require.
What works fine so far is applying the patch below and using a patched
tika-bundle-0.9. Not sure whether you want to fix this in Jackrabbit or Tika
or whether I'm just messing up things... but maybe it helps someone :)
Marc
Index: src/main/java/org/apache/jackrabbit/core/query/lucene
/JackrabbitParser.java
===================================================================
--- src/main/java/org/apache/jackrabbit/core/query/lucene
/JackrabbitParser.java (revision 1071702)
+++ src/main/java/org/apache/jackrabbit/core/query/lucene
/JackrabbitParser.java (working copy)
@@ -33,6 +33,7 @@
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.parser.image.ImageParser;
import org.apache.tika.parser.microsoft.OfficeParser;
+import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.parser.odf.OpenDocumentParser;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.parser.rtf.RTFParser;
@@ -79,7 +80,13 @@
try {
if (stream != null) {
try {
- parser = new AutoDetectParser(new TikaConfig(stream));
+ ClassLoader classLoader =
Thread.currentThread().getContextClassLoader();
+
Thread.currentThread().setContextClassLoader(TikaConfig.class.getClassLoader());
+ try {
+ parser = new AutoDetectParser(new TikaConfig(stream));
+ } finally {
+ Thread.currentThread().setContextClassLoader(classLoader);
+ }
} finally {
stream.close();
}
@@ -134,9 +141,10 @@
parsers.put(MediaType.application("vnd.ms-powerpoint"),
parser);
parsers.put(MediaType.application("mspowerpoint"), parser);
parsers.put(MediaType.application("vnd.ms-excel"), parser);
-
parsers.put(MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"),
parser);
-
parsers.put(MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"),
parser);
-
parsers.put(MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"),
parser);
+ OOXMLParser ooxmlParser = new OOXMLParser();
+
parsers.put(MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"),
ooxmlParser);
+
parsers.put(MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"),
ooxmlParser);
+
parsers.put(MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"),
ooxmlParser);
} else if
(name.equals("org.apache.jackrabbit.extractor.OpenOfficeTextExtractor")) {
Parser parser = new OpenDocumentParser();
parsers.put(MediaType.application("vnd.oasis.opendocument.database"),
parser);
Index: src/main/resources/org/apache/jackrabbit/core/query/lucene/tika
-config.xml
===================================================================
--- src/main/resources/org/apache/jackrabbit/core/query/lucene/tika
-config.xml (revision 1071702)
+++ src/main/resources/org/apache/jackrabbit/core/query/lucene/tika
-config.xml (working copy)
@@ -23,11 +23,6 @@
<parser class="org.apache.tika.parser.DefaultParser"/>
- <parser class="org.apache.jackrabbit.core.query.pdf.PDFParser">
- <!-- JCR-2838: Override the faulty PDF parser in Tika 0.8 -->
- <mime>application/pdf</mime>
- </parser>
-
<parser class="org.apache.tika.parser.EmptyParser">
<!-- Disable package extraction as it's too resource-intensive -->
<mime>application/x-archive</mime>