You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "David T. Webb" <da...@brightmove.com> on 2011/11/12 16:07:48 UTC
TikaEntityProcesor Exception Handling
When indexing over 2MM documents with Solr and the TikaEntityProcessor,
the indexing fails if Tika encounters an exception with one of the
documents. How can I tell Solr to keep going and just ignore the failed
documents from the Tika Processor?
Thanks.
--
Sincerely,
David Webb
Re: TikaEntityProcesor Exception Handling
Posted by akash2489 <ma...@gmail.com>.
Any updates on this?
--
View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcesor-Exception-Handling-tp3502495p4129580.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: TikaEntityProcesor Exception Handling
Posted by Mark Miller <ma...@gmail.com>.
I'd file a JIRA issue.
On Nov 12, 2011, at 10:39 AM, David T. Webb wrote:
> Same result on onError="continue" .
>
> Any help is appreciated....thank you.
>
> --
> Sincerely,
> David Webb
>
>
>
> -----Original Message-----
> From: David T. Webb [mailto:david.webb@brightmove.com]
> Sent: Saturday, November 12, 2011 10:27 AM
> To: solr-user@lucene.apache.org
> Subject: RE: TikaEntityProcesor Exception Handling
>
> I found the answer with the onError="skip" on the Entity, However,
> after adding that parameter to the data-config.xml, the index processing
> still stops when the TikaEntityProcessor throws an Exception.
>
> Nov 12, 2011 10:22:16 AM org.apache.solr.common.SolrException log
> SEVERE: Full Import
> failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Unable to read content Processing Document # 562
> at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThr
> ow(DataImportHandlerException.java:72)
> at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
> yProcessor.java:130)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity
> ProcessorWrapper.java:238)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:596)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:622)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:622)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java
> :268)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18
> 7)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte
> r.java:359)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java
> :427)
> at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:
> 408)
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.ParserDecorator$1@8a799a
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
> yProcessor.java:128)
> ... 9 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 29
> at
> org.apache.poi.hwpf.model.StyleSheet.getCharacterStyle(StyleSheet.java:3
> 15)
> at
> org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:60)
> at
> org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
> at
> org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
> at
> org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.jav
> a:191)
> at
> org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(Wor
> dExtractor.java:429)
> at
> org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(Wor
> dExtractor.java:419)
> at
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:
> 75)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:18
> 7)
> at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 11 more
>
> Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2
> rollback
> INFO: start rollback
> Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2
> rollback
> INFO: end_rollback
> --
> Sincerely,
> David Webb
>
>
>
> -----Original Message-----
> From: David T. Webb [mailto:david.webb@brightmove.com]
> Sent: Saturday, November 12, 2011 10:08 AM
> To: solr-user@lucene.apache.org
> Subject: TikaEntityProcesor Exception Handling
>
> When indexing over 2MM documents with Solr and the TikaEntityProcessor,
> the indexing fails if Tika encounters an exception with one of the
> documents. How can I tell Solr to keep going and just ignore the failed
> documents from the Tika Processor?
>
>
>
> Thanks.
>
>
>
> --
>
> Sincerely,
>
> David Webb
>
- Mark Miller
lucidimagination.com
RE: TikaEntityProcesor Exception Handling
Posted by "David T. Webb" <da...@brightmove.com>.
Same result on onError="continue" .
Any help is appreciated....thank you.
--
Sincerely,
David Webb
-----Original Message-----
From: David T. Webb [mailto:david.webb@brightmove.com]
Sent: Saturday, November 12, 2011 10:27 AM
To: solr-user@lucene.apache.org
Subject: RE: TikaEntityProcesor Exception Handling
I found the answer with the onError="skip" on the Entity, However,
after adding that parameter to the data-config.xml, the index processing
still stops when the TikaEntityProcessor throws an Exception.
Nov 12, 2011 10:22:16 AM org.apache.solr.common.SolrException log
SEVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to read content Processing Document # 562
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThr
ow(DataImportHandlerException.java:72)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
yProcessor.java:130)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity
ProcessorWrapper.java:238)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:596)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:622)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:622)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java
:268)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18
7)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte
r.java:359)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java
:427)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:
408)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.ParserDecorator$1@8a799a
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
yProcessor.java:128)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 29
at
org.apache.poi.hwpf.model.StyleSheet.getCharacterStyle(StyleSheet.java:3
15)
at
org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:60)
at
org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
at
org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
at
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.jav
a:191)
at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(Wor
dExtractor.java:429)
at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(Wor
dExtractor.java:419)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:
75)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:18
7)
at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 11 more
Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2
rollback
INFO: start rollback
Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2
rollback
INFO: end_rollback
--
Sincerely,
David Webb
-----Original Message-----
From: David T. Webb [mailto:david.webb@brightmove.com]
Sent: Saturday, November 12, 2011 10:08 AM
To: solr-user@lucene.apache.org
Subject: TikaEntityProcesor Exception Handling
When indexing over 2MM documents with Solr and the TikaEntityProcessor,
the indexing fails if Tika encounters an exception with one of the
documents. How can I tell Solr to keep going and just ignore the failed
documents from the Tika Processor?
Thanks.
--
Sincerely,
David Webb
RE: TikaEntityProcesor Exception Handling
Posted by "David T. Webb" <da...@brightmove.com>.
I found the answer with the onError="skip" on the Entity, However,
after adding that parameter to the data-config.xml, the index processing
still stops when the TikaEntityProcessor throws an Exception.
Nov 12, 2011 10:22:16 AM org.apache.solr.common.SolrException log
SEVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to read content Processing Document # 562
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThr
ow(DataImportHandlerException.java:72)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
yProcessor.java:130)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity
ProcessorWrapper.java:238)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:596)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:622)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:622)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java
:268)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18
7)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte
r.java:359)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java
:427)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:
408)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.ParserDecorator$1@8a799a
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
yProcessor.java:128)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 29
at
org.apache.poi.hwpf.model.StyleSheet.getCharacterStyle(StyleSheet.java:3
15)
at
org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:60)
at
org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
at
org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
at
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.jav
a:191)
at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(Wor
dExtractor.java:429)
at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(Wor
dExtractor.java:419)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:
75)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:18
7)
at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 11 more
Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2
rollback
INFO: start rollback
Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2
rollback
INFO: end_rollback
--
Sincerely,
David Webb
-----Original Message-----
From: David T. Webb [mailto:david.webb@brightmove.com]
Sent: Saturday, November 12, 2011 10:08 AM
To: solr-user@lucene.apache.org
Subject: TikaEntityProcesor Exception Handling
When indexing over 2MM documents with Solr and the TikaEntityProcessor,
the indexing fails if Tika encounters an exception with one of the
documents. How can I tell Solr to keep going and just ignore the failed
documents from the Tika Processor?
Thanks.
--
Sincerely,
David Webb