You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2009/03/18 04:19:35 UTC
Re: com.ctc.wstx.exc.WstxLazyException exception while passing the
text content of a word doc to SOLR
: I am using Apache POI parser to parse a Word Doc and extract the text
: content. Then i am passing the text content to SOLR. The Word document has
: many pictures, graphs and tables. But when i am passing the content to SOLR,
: it fails. Here is the exception trace.
:
: 09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM
: org.apache.solr.common.SolrException log
: SEVERE: [com.ctc.wstx.exc.WstxLazyException]
: com.ctc.wstx.exc.WstxParsingException: Illegal charact
: er entity: expansion character (code 0x7) not a valid XML character
: at [row,col {unknown-source}]: [40,18]
the error string is fairly self explanatory: on line 40, column 18 you
have a character that isn't legal in XML (0x7)
(not all UTF-8 characters are legal in XML)
If search the solr archives for "Illegal character" you'll find lots of
discussion about this and how to deal with this in general.
You might also want to check out this comment pointing out some advantages
in using Tika instead of using POI directly...
https://issues.apache.org/jira/browse/LUCENE-1559?#action_12681347
..lastly you might wnat to check out this plugin and do all hte hard work
server side...
http://wiki.apache.org/solr/ExtractingRequestHandler
-Hoss