You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matthew Parker <mp...@apogeeintegration.com> on 2012/02/27 01:33:59 UTC

TIKA Errors Importing MS Word Documents into SOLR Cloud

I tried to import some documents into SOLR Cloud using Apache Manifold.

TIKA started throwing exceptions for various documents

The exception reads like the following:

org.apache.solr.common.SolrException
at org.apache.solr.handler.extraction.ExtractionDocumentLoader.load(
ExtractingDocumentLoader.java: 213)
..........

Caused by:  org.apache.tika.exception.TikaException:
UnexpectedRuntimeException from
org.apche.tika.parser.microsoft.OfficeParser@d394424
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
...........
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(NativeMethod)
at
org.apache.poi.hwpf.usermodel.Picture.fillRawImageContent(Picture.java:363)

It seems to be related to the following fix now in Tika 1.1

https://issues.apache.org/bugzilla/show_bug.cgi?id=51902

Can the Tika libraries in the SOLR trunk be updated?

------------------------------
This e-mail and any files transmitted with it may be proprietary.  Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.

Re: TIKA Errors Importing MS Word Documents into SOLR Cloud

Posted by Erick Erickson <er...@gmail.com>.
You *probaby* can update the Tika libraries in Solr, but it'll be "interesting"
to get all the right ones updated, there are a bunch of them in Tika. And I
make no guarantees.

If it proves difficult, it's not too hard to write a SolrJ program that does
the Tika extraction and run it on a client totally separated from the Solr
server.

Best
Erick

On Sun, Feb 26, 2012 at 7:33 PM, Matthew Parker
<mp...@apogeeintegration.com> wrote:
> I tried to import some documents into SOLR Cloud using Apache Manifold.
>
> TIKA started throwing exceptions for various documents
>
> The exception reads like the following:
>
> org.apache.solr.common.SolrException
> at org.apache.solr.handler.extraction.ExtractionDocumentLoader.load(
> ExtractingDocumentLoader.java: 213)
> ..........
>
> Caused by:  org.apache.tika.exception.TikaException:
> UnexpectedRuntimeException from
> org.apche.tika.parser.microsoft.OfficeParser@d394424
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> ...........
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at java.lang.System.arraycopy(NativeMethod)
> at
> org.apache.poi.hwpf.usermodel.Picture.fillRawImageContent(Picture.java:363)
>
> It seems to be related to the following fix now in Tika 1.1
>
> https://issues.apache.org/bugzilla/show_bug.cgi?id=51902
>
> Can the Tika libraries in the SOLR trunk be updated?
>
> ------------------------------
> This e-mail and any files transmitted with it may be proprietary.  Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.