You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Advokat (JIRA)" <ji...@apache.org> on 2017/12/15 11:56:00 UTC

[jira] [Commented] (TIKA-2529) ArrayIndexOutOfBoundsException when processing certain .doc files

    [ https://issues.apache.org/jira/browse/TIKA-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292408#comment-16292408 ] 

Advokat commented on TIKA-2529:
-------------------------------

I am not sure if this is the best place to report this type of bugs since we are not using Tika directly but only through Solr. If i should report this issues in the Solr Project instead please let me know.

> ArrayIndexOutOfBoundsException when processing certain .doc files
> -----------------------------------------------------------------
>
>                 Key: TIKA-2529
>                 URL: https://issues.apache.org/jira/browse/TIKA-2529
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>         Environment: We are using Solr Version 7.1.0
>            Reporter: Advokat
>         Attachments: ProblemFormatIstAlt.doc
>
>
> When Solr (7.1.0) is trying to parse this .doc file we get following exception:
> Seems to be related to an older form of .doc files because converting the .doc to a .docx and then back to a .doc fixes this issue.
> {
>   "responseHeader":{
>     "status":500,
>     "QTime":265},
>   "error":{
>     "metadata":[
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","java.lang.ArrayIndexOutOfBoundsException"],
>     "msg":"org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@20395b83",
>     "trace":"org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@20395b83\r\n\tat org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)\r\n\tat org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)\r\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)\r\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2484)\r\n\tat org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:720)\r\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:526)\r\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)\r\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)\r\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)\r\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\r\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\r\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\r\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\r\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\r\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\r\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\r\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\r\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\r\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\r\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\r\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\r\n\tat org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\r\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\r\n\tat org.eclipse.jetty.server.Server.handle(Server.java:534)\r\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\r\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\r\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\r\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\r\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\r\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\r\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\r\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\r\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\r\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\r\n\tat java.lang.Thread.run(Unknown Source)\r\nCaused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@20395b83\r\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)\r\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\r\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\r\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)\r\n\tat org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)\r\n\t... 34 more\r\nCaused by: java.lang.ArrayIndexOutOfBoundsException: -1\r\n\tat org.apache.poi.hwpf.model.StyleSheet.getCharacterStyle(StyleSheet.java:329)\r\n\tat org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:74)\r\n\tat org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:100)\r\n\tat org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:727)\r\n\tat org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:227)\r\n\tat org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:712)\r\n\tat org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:702)\r\n\tat org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:174)\r\n\tat org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:175)\r\n\tat org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)\r\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\r\n\t... 38 more\r\n",
>     "code":500}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)