You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2009/03/18 04:19:35 UTC

Re: com.ctc.wstx.exc.WstxLazyException exception while passing the text content of a word doc to SOLR

: I am using Apache POI parser to parse a Word Doc and extract the text
: content. Then i am passing the text content to SOLR. The Word document has
: many pictures, graphs and tables. But when i am passing the content to SOLR,
: it fails. Here is the exception trace.
: 
: 09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM
: org.apache.solr.common.SolrException log
: SEVERE: [com.ctc.wstx.exc.WstxLazyException]
: com.ctc.wstx.exc.WstxParsingException: Illegal charact
: er entity: expansion character (code 0x7) not a valid XML character
:  at [row,col {unknown-source}]: [40,18]

the error string is fairly self explanatory: on line 40, column 18 you 
have a character that isn't legal in XML (0x7)

(not all UTF-8 characters are legal in XML)

If search the solr archives for "Illegal character" you'll find lots of 
discussion about this and how to deal with this in general.

You might also want to check out this comment pointing out some advantages 
in using Tika instead of using POI directly...

https://issues.apache.org/jira/browse/LUCENE-1559?#action_12681347

..lastly you might wnat to check out this plugin and do all hte hard work 
server side...

http://wiki.apache.org/solr/ExtractingRequestHandler




-Hoss