You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Webb (Updated) (JIRA)" <ji...@apache.org> on 2011/11/13 20:15:51 UTC

[jira] [Updated] (SOLR-2896) TikiEntityProcessor onError not working in some cases

     [ https://issues.apache.org/jira/browse/SOLR-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Webb updated SOLR-2896:
-----------------------------

    Attachment: resume only true.doc

File that causes TikaException that onError does not handle.
                
> TikiEntityProcessor onError not working in some cases
> -----------------------------------------------------
>
>                 Key: SOLR-2896
>                 URL: https://issues.apache.org/jira/browse/SOLR-2896
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>    Affects Versions: 3.4
>         Environment: Windows 7, JDK 1.6.0_18, Solr 3.4.0
>            Reporter: David Webb
>         Attachments: resume only true.doc
>
>
> When using the TikaEntityProcessor, I can a particular document (attached for testing) that causes a TikaException.  If the onError parameter of the TikaEntityProcessor is set to "skip" or "continue", the DIH still aborts and rolls back the entire indexing process.
> {code:title=data-config.xml snippet}
> <entity name="attach" onError="skip"
> 			query = "select filename, filedata from table where id = ${parentEntity.ID}"
> 	<field column="filename" name="filename"/>
> 	<entity dataSource="f2" processor="TikaEntityProcessor" url="filedata" dataField="attach.FILEDATA" format="text">
>                <field column="text" name="filedata" />
>         </entity>
> </entity>
> {code}
> {code}
> Nov 12, 2011 10:22:16 AM org.apache.solr.common.SolrException log
> SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 562
>         at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>         at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
>         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
>         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
>         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
>         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
>         at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
>         at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
>         at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
>         at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
>         at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
> Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@8a799a
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
>         at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
>         ... 9 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 29
>         at org.apache.poi.hwpf.model.StyleSheet.getCharacterStyle(StyleSheet.java:315)
>         at org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:60)
>         at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
>         at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
>         at org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:191)
>         at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:429)
>         at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:419)
>         at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75)
>         at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:187)
>         at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>         ... 11 more
> Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: start rollback
> Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: end_rollback
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org