You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2015/11/05 01:32:32 UTC

Re: tikaparser docx file fails with exception

Possibly a corrupt file? Tika does its best, but bad data is...bad data.

You can experiment a bit with using Tika in Java, that might give you
a better idea of what's really going on, here's a SolrJ example:

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Wed, Nov 4, 2015 at 3:49 PM, Aswath Srinivasan (TMS)
<as...@toyota.com> wrote:
>
> Trying to index a document. A docx file. Ending up with the below exception. Not sure why it is erroring out. When I opened the docx I was able to see lots of binary data like embedded pictures etc., Is there a possible solution to this or is it a bug? Only one such file fails. Rest of the files are smoothly indexed.
>
> 2015-11-04 23:16:11.549 INFO  (coreLoadExecutor-6-thread-1) [   x:tika] o.a.s.c.CoreContainer registering core: tika
> 2015-11-04 23:16:11.549 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore QuerySenderListener sending requests to Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
> 2015-11-04 23:16:11.585 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.S.Request [tika] webapp=null path=null params={q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&event=firstSearcher} hits=0 status=0 QTime=34
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore QuerySenderListener done.
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: default
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: wordbreak
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SuggestComponent buildOnStartup: mySuggester
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.s.s.SolrSuggester SolrSuggester.build(mySuggester)
> 2015-11-04 23:16:11.605 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore [tika] Registered new searcher Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
> 2015-11-04 23:16:25.923 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter Loading DIH Configuration: tika-data-config.xml
> 2015-11-04 23:16:25.937 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter Data Configuration loaded successfully
> 2015-11-04 23:16:25.947 INFO  (qtp7980742-16) [   x:tika] o.a.s.c.S.Request [tika] webapp=/solr path=/dataimport params={debug=false&optimize=false&indent=true&commit=true&clean=true&wt=json&command=full-import&verbose=false} status=0 QTime=28
> 2015-11-04 23:16:25.948 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DataImporter Starting Full Import
> 2015-11-04 23:16:25.961 INFO  (Thread-17) [   x:tika] o.a.s.h.d.SimplePropertiesWriter Read dataimport.properties
> 2015-11-04 23:16:25.966 INFO  (qtp7980742-14) [   x:tika] o.a.s.c.S.Request [tika] webapp=/solr path=/dataimport params={indent=true&wt=json&command=status&_=1446678985952} status=0 QTime=1
> 2015-11-04 23:16:25.998 INFO  (Thread-17) [   x:tika] o.a.s.c.SolrCore [tika] REMOVING ALL DOCUMENTS FROM INDEX
> 2015-11-04 23:16:26.728 ERROR (Thread-17) [   x:tika] o.a.s.h.d.EntityProcessorWrapper Exception in entity : documentImport:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1
>
>       at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:70)
>
>       at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:168)
>
>       at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
>
>       at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
>
>       at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:514)
>
>       at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
>
>       at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
>
>       at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
>
>       at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
>
>       at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
>
>       at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)
>
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6<ma...@1b3e0a6>
>
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:262)
>
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>       at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>
>       at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:162)
>
>       ... 9 more
>
> Caused by: java.io.CharConversionException: Characters larger than 4 bytes are not supported: byte 0xb7 implies a length of more than 4 bytes
>
>       at org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
>
>       at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecoder.read(XMLStreamReader.java:762)
>
>       at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamReader.java:162)
>
>       at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLexer.java:3477)
>
>       at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:3962)
>
>       at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>
>       at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
>
>       at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>
>       at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>
>       at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
>
>       at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
>
>       at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
>
>       at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
>
>       at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
>
>       at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>
>       at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:118)
>
>       at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:59)
>
>       at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:181)
>
>       at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>
>       at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>       ... 12 more
>
>
> 2015-11-04 23:16:26.729 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DocBuilder Import completed successfully
>

RE: tikaparser docx file fails with exception

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Agree with all below, and don't hesitate to open a ticket on Tika's Jira and/or POI's bugzilla...especially if you can share the triggering document.

-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafalov@gmail.com] 
Sent: Thursday, November 05, 2015 6:05 PM
To: solr-user <so...@lucene.apache.org>
Subject: Re: tikaparser docx file fails with exception

It is quite clear actually that the problem is this:
Caused by: java.io.CharConversionException: Characters larger than 4 bytes are not supported: byte 0xb7 implies a length of more than 4 bytes
      at org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
      at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecoder.read(XMLStreamReader.java:762)
      at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamReader.java:162)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLexer.java:3477)

If you search for something like: PiccoloLexer.yy_refill Characters larger than 4 bytes are not supported:
you get lots of various matches in different forums for different (java-based? tika-based?) software. Most likely Tika found something obscure in the document that there is no implementations for yet. E.g.
an image inside a text field inside a footer section. Just as an example....

I would basically try standalone Tika and look for the most expressive debug flag. It should tell you which file inside the zip that docx actually is caused the problem. That should give you some hint.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 5 November 2015 at 17:36, Aswath Srinivasan (TMS) <as...@toyota.com> wrote:
> Thank you for attempting to answer. I will try out with solrj and standalone java with tika parser. I completely understand that a bad document could cause this, however, when I opened up the document I couldn't find anything suspicious expect for some binary images/pictures embedded into the document.
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, November 04, 2015 4:33 PM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: tikaparser docx file fails with exception
>
> Possibly a corrupt file? Tika does its best, but bad data is...bad data.
>
> You can experiment a bit with using Tika in Java, that might give you a better idea of what's really going on, here's a SolrJ example:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Wed, Nov 4, 2015 at 3:49 PM, Aswath Srinivasan (TMS) <as...@toyota.com> wrote:
>>
>> Trying to index a document. A docx file. Ending up with the below exception. Not sure why it is erroring out. When I opened the docx I was able to see lots of binary data like embedded pictures etc., Is there a possible solution to this or is it a bug? Only one such file fails. Rest of the files are smoothly indexed.
>>
>> 2015-11-04 23:16:11.549 INFO  (coreLoadExecutor-6-thread-1) [   x:tika] o.a.s.c.CoreContainer registering core: tika
>> 2015-11-04 23:16:11.549 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore QuerySenderListener sending requests to Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:11.585 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.S.Request [tika] webapp=null path=null params={q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&event=firstSearcher} hits=0 status=0 QTime=34
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore QuerySenderListener done.
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: default
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: wordbreak
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SuggestComponent buildOnStartup: mySuggester
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.s.s.SolrSuggester SolrSuggester.build(mySuggester)
>> 2015-11-04 23:16:11.605 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore [tika] Registered new searcher Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:25.923 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter Loading DIH Configuration: tika-data-config.xml
>> 2015-11-04 23:16:25.937 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter Data Configuration loaded successfully
>> 2015-11-04 23:16:25.947 INFO  (qtp7980742-16) [   x:tika] o.a.s.c.S.Request [tika] webapp=/solr path=/dataimport params={debug=false&optimize=false&indent=true&commit=true&clean=true&wt=json&command=full-import&verbose=false} status=0 QTime=28
>> 2015-11-04 23:16:25.948 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DataImporter Starting Full Import
>> 2015-11-04 23:16:25.961 INFO  (Thread-17) [   x:tika] o.a.s.h.d.SimplePropertiesWriter Read dataimport.properties
>> 2015-11-04 23:16:25.966 INFO  (qtp7980742-14) [   x:tika] o.a.s.c.S.Request [tika] webapp=/solr path=/dataimport params={indent=true&wt=json&command=status&_=1446678985952} status=0 QTime=1
>> 2015-11-04 23:16:25.998 INFO  (Thread-17) [   x:tika] o.a.s.c.SolrCore [tika] REMOVING ALL DOCUMENTS FROM INDEX
>> 2015-11-04 23:16:26.728 ERROR (Thread-17) [   x:tika] o.a.s.h.d.EntityProcessorWrapper Exception in entity : documentImport:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
>> T
>> hrow(DataImportHandlerException.java:70)
>>
>>       at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
>> t
>> ityProcessor.java:168)
>>
>>       at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>> i
>> tyProcessorWrapper.java:243)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r
>> .java:475)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r
>> .java:514)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r
>> .java:414)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
>> a
>> va:329)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:
>> 232)
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
>> r
>> ter.java:416)
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
>> a
>> va:480)
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
>> v
>> a:461)
>>
>> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
>> IOException from 
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6<mailto:org.
>> apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6>
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:262
>> )
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
>> )
>>
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 2
>> 0)
>>
>>       at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
>> t
>> ityProcessor.java:162)
>>
>>       ... 9 more
>>
>> Caused by: java.io.CharConversionException: Characters larger than 4 
>> bytes are not supported: byte 0xb7 implies a length of more than 4 
>> bytes
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDec
>> o
>> der.java:162)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecode
>> r
>> .read(XMLStreamReader.java:762)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamRe
>> a
>> der.java:162)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLe
>> x
>> er.java:3477)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.
>> j
>> ava:3962)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:140
>> 0
>> )
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>>
>>       at
>> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479
>> )
>>
>>       at
>> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:12
>> 7
>> 7)
>>
>>       at
>> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:12
>> 6
>> 4)
>>
>>       at
>> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaType
>> L
>> oaderBase.java:345)
>>
>>       at
>> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocume
>> n
>> t$Factory.parse(Unknown Source)
>>
>>       at
>> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocumen
>> t
>> .java:136)
>>
>>       at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>>
>>       at
>> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:1
>> 1
>> 8)
>>
>>       at
>> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtrac
>> t
>> or.java:59)
>>
>>       at
>> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFa
>> c
>> tory.java:181)
>>
>>       at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OO
>> X
>> MLExtractorFactory.java:86)
>>
>>       at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.
>> j
>> ava:82)
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
>> )
>>
>>       ... 12 more
>>
>>
>> 2015-11-04 23:16:26.729 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DocBuilder Import completed successfully
>>

Re: tikaparser docx file fails with exception

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

It is quite clear actually that the problem is this:
Caused by: java.io.CharConversionException: Characters larger than 4
bytes are not supported: byte 0xb7 implies a length of more than 4
bytes
      at org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
      at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecoder.read(XMLStreamReader.java:762)
      at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamReader.java:162)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLexer.java:3477)

If you search for something like: PiccoloLexer.yy_refill Characters
larger than 4 bytes are not supported:
you get lots of various matches in different forums for different
(java-based? tika-based?) software. Most likely Tika found something
obscure in the document that there is no implementations for yet. E.g.
an image inside a text field inside a footer section. Just as an
example....

I would basically try standalone Tika and look for the most expressive
debug flag. It should tell you which file inside the zip that docx
actually is caused the problem. That should give you some hint.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 5 November 2015 at 17:36, Aswath Srinivasan (TMS)
<as...@toyota.com> wrote:
> Thank you for attempting to answer. I will try out with solrj and standalone java with tika parser. I completely understand that a bad document could cause this, however, when I opened up the document I couldn't find anything suspicious expect for some binary images/pictures embedded into the document.
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, November 04, 2015 4:33 PM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: tikaparser docx file fails with exception
>
> Possibly a corrupt file? Tika does its best, but bad data is...bad data.
>
> You can experiment a bit with using Tika in Java, that might give you a better idea of what's really going on, here's a SolrJ example:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Wed, Nov 4, 2015 at 3:49 PM, Aswath Srinivasan (TMS) <as...@toyota.com> wrote:
>>
>> Trying to index a document. A docx file. Ending up with the below exception. Not sure why it is erroring out. When I opened the docx I was able to see lots of binary data like embedded pictures etc., Is there a possible solution to this or is it a bug? Only one such file fails. Rest of the files are smoothly indexed.
>>
>> 2015-11-04 23:16:11.549 INFO  (coreLoadExecutor-6-thread-1) [   x:tika] o.a.s.c.CoreContainer registering core: tika
>> 2015-11-04 23:16:11.549 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore QuerySenderListener sending requests to Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:11.585 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.S.Request [tika] webapp=null path=null params={q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&event=firstSearcher} hits=0 status=0 QTime=34
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore QuerySenderListener done.
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: default
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: wordbreak
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SuggestComponent buildOnStartup: mySuggester
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.s.s.SolrSuggester SolrSuggester.build(mySuggester)
>> 2015-11-04 23:16:11.605 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore [tika] Registered new searcher Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:25.923 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter Loading DIH Configuration: tika-data-config.xml
>> 2015-11-04 23:16:25.937 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter Data Configuration loaded successfully
>> 2015-11-04 23:16:25.947 INFO  (qtp7980742-16) [   x:tika] o.a.s.c.S.Request [tika] webapp=/solr path=/dataimport params={debug=false&optimize=false&indent=true&commit=true&clean=true&wt=json&command=full-import&verbose=false} status=0 QTime=28
>> 2015-11-04 23:16:25.948 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DataImporter Starting Full Import
>> 2015-11-04 23:16:25.961 INFO  (Thread-17) [   x:tika] o.a.s.h.d.SimplePropertiesWriter Read dataimport.properties
>> 2015-11-04 23:16:25.966 INFO  (qtp7980742-14) [   x:tika] o.a.s.c.S.Request [tika] webapp=/solr path=/dataimport params={indent=true&wt=json&command=status&_=1446678985952} status=0 QTime=1
>> 2015-11-04 23:16:25.998 INFO  (Thread-17) [   x:tika] o.a.s.c.SolrCore [tika] REMOVING ALL DOCUMENTS FROM INDEX
>> 2015-11-04 23:16:26.728 ERROR (Thread-17) [   x:tika] o.a.s.h.d.EntityProcessorWrapper Exception in entity : documentImport:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndT
>> hrow(DataImportHandlerException.java:70)
>>
>>       at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEnt
>> ityProcessor.java:168)
>>
>>       at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Enti
>> tyProcessorWrapper.java:243)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
>> .java:475)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
>> .java:514)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
>> .java:414)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.ja
>> va:329)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:
>> 232)
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpor
>> ter.java:416)
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.ja
>> va:480)
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.jav
>> a:461)
>>
>> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
>> IOException from
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6<mailto:org.
>> apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6>
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:262)
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>>
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:12
>> 0)
>>
>>       at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEnt
>> ityProcessor.java:162)
>>
>>       ... 9 more
>>
>> Caused by: java.io.CharConversionException: Characters larger than 4
>> bytes are not supported: byte 0xb7 implies a length of more than 4
>> bytes
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDeco
>> der.java:162)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecoder
>> .read(XMLStreamReader.java:762)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamRea
>> der.java:162)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLex
>> er.java:3477)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.j
>> ava:3962)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400
>> )
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>>
>>       at
>> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>>
>>       at
>> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:127
>> 7)
>>
>>       at
>> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:126
>> 4)
>>
>>       at
>> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeL
>> oaderBase.java:345)
>>
>>       at
>> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocumen
>> t$Factory.parse(Unknown Source)
>>
>>       at
>> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument
>> .java:136)
>>
>>       at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>>
>>       at
>> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:11
>> 8)
>>
>>       at
>> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtract
>> or.java:59)
>>
>>       at
>> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFac
>> tory.java:181)
>>
>>       at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOX
>> MLExtractorFactory.java:86)
>>
>>       at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.j
>> ava:82)
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>>
>>       ... 12 more
>>
>>
>> 2015-11-04 23:16:26.729 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DocBuilder Import completed successfully
>>

RE: tikaparser docx file fails with exception

Posted by "Aswath Srinivasan (TMS)" <as...@toyota.com>.

Thank you for attempting to answer. I will try out with solrj and standalone java with tika parser. I completely understand that a bad document could cause this, however, when I opened up the document I couldn't find anything suspicious expect for some binary images/pictures embedded into the document.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Wednesday, November 04, 2015 4:33 PM
To: solr-user <so...@lucene.apache.org>
Subject: Re: tikaparser docx file fails with exception

Possibly a corrupt file? Tika does its best, but bad data is...bad data.

You can experiment a bit with using Tika in Java, that might give you a better idea of what's really going on, here's a SolrJ example:

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Wed, Nov 4, 2015 at 3:49 PM, Aswath Srinivasan (TMS) <as...@toyota.com> wrote:
>
> Trying to index a document. A docx file. Ending up with the below exception. Not sure why it is erroring out. When I opened the docx I was able to see lots of binary data like embedded pictures etc., Is there a possible solution to this or is it a bug? Only one such file fails. Rest of the files are smoothly indexed.
>
> 2015-11-04 23:16:11.549 INFO  (coreLoadExecutor-6-thread-1) [   x:tika] o.a.s.c.CoreContainer registering core: tika
> 2015-11-04 23:16:11.549 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore QuerySenderListener sending requests to Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
> 2015-11-04 23:16:11.585 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.S.Request [tika] webapp=null path=null params={q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&event=firstSearcher} hits=0 status=0 QTime=34
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore QuerySenderListener done.
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: default
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: wordbreak
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.h.c.SuggestComponent buildOnStartup: mySuggester
> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.s.s.SolrSuggester SolrSuggester.build(mySuggester)
> 2015-11-04 23:16:11.605 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore [tika] Registered new searcher Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
> 2015-11-04 23:16:25.923 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter Loading DIH Configuration: tika-data-config.xml
> 2015-11-04 23:16:25.937 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter Data Configuration loaded successfully
> 2015-11-04 23:16:25.947 INFO  (qtp7980742-16) [   x:tika] o.a.s.c.S.Request [tika] webapp=/solr path=/dataimport params={debug=false&optimize=false&indent=true&commit=true&clean=true&wt=json&command=full-import&verbose=false} status=0 QTime=28
> 2015-11-04 23:16:25.948 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DataImporter Starting Full Import
> 2015-11-04 23:16:25.961 INFO  (Thread-17) [   x:tika] o.a.s.h.d.SimplePropertiesWriter Read dataimport.properties
> 2015-11-04 23:16:25.966 INFO  (qtp7980742-14) [   x:tika] o.a.s.c.S.Request [tika] webapp=/solr path=/dataimport params={indent=true&wt=json&command=status&_=1446678985952} status=0 QTime=1
> 2015-11-04 23:16:25.998 INFO  (Thread-17) [   x:tika] o.a.s.c.SolrCore [tika] REMOVING ALL DOCUMENTS FROM INDEX
> 2015-11-04 23:16:26.728 ERROR (Thread-17) [   x:tika] o.a.s.h.d.EntityProcessorWrapper Exception in entity : documentImport:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1
>
>       at 
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndT
> hrow(DataImportHandlerException.java:70)
>
>       at 
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEnt
> ityProcessor.java:168)
>
>       at 
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Enti
> tyProcessorWrapper.java:243)
>
>       at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:475)
>
>       at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:514)
>
>       at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:414)
>
>       at 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.ja
> va:329)
>
>       at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:
> 232)
>
>       at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpor
> ter.java:416)
>
>       at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.ja
> va:480)
>
>       at 
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.jav
> a:461)
>
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6<mailto:org.
> apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6>
>
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:262)
>
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:12
> 0)
>
>       at 
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEnt
> ityProcessor.java:162)
>
>       ... 9 more
>
> Caused by: java.io.CharConversionException: Characters larger than 4 
> bytes are not supported: byte 0xb7 implies a length of more than 4 
> bytes
>
>       at 
> org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDeco
> der.java:162)
>
>       at 
> org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecoder
> .read(XMLStreamReader.java:762)
>
>       at 
> org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamRea
> der.java:162)
>
>       at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLex
> er.java:3477)
>
>       at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.j
> ava:3962)
>
>       at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>
>       at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400
> )
>
>       at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>
>       at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>
>       at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:127
> 7)
>
>       at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:126
> 4)
>
>       at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeL
> oaderBase.java:345)
>
>       at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocumen
> t$Factory.parse(Unknown Source)
>
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument
> .java:136)
>
>       at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:11
> 8)
>
>       at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtract
> or.java:59)
>
>       at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFac
> tory.java:181)
>
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOX
> MLExtractorFactory.java:86)
>
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.j
> ava:82)
>
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>       ... 12 more
>
>
> 2015-11-04 23:16:26.729 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DocBuilder Import completed successfully
>