You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Suryasnat Das <su...@gmail.com> on 2009/03/14 05:22:39 UTC

com.ctc.wstx.exc.WstxLazyException exception while passing the text content of a word doc to SOLR

Hi,

I am using Apache POI parser to parse a Word Doc and extract the text
content. Then i am passing the text content to SOLR. The Word document has
many pictures, graphs and tables. But when i am passing the content to SOLR,
it fails. Here is the exception trace.

09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM
org.apache.solr.common.SolrException log
SEVERE: [com.ctc.wstx.exc.WstxLazyException]
com.ctc.wstx.exc.WstxParsingException: Illegal charact
er entity: expansion character (code 0x7) not a valid XML character
 at [row,col {unknown-source}]: [40,18]
        at
com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
        at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
        at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
        at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
        at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327
)
        at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.ja
va:195)
        at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandle
r.java:123)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
        at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.
java:235)
        at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206
)
        at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
        at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.
java:235)
        at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206
)
        at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
        at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.j
ava:190)
        at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
        at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextE
stablishmentValve.java:126)
        at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEs
tablishmentValve.java:70)
        at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java
:158)
        at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
        at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:828)
        at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.j
ava:601)
        at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
        at java.lang.Thread.run(Thread.java:595).

Another error trace relating to POI is also throwing up:

09:31:04,828 ERROR [STDERR] java.io.IOException: Unable to read entire
header; 130 bytes read; expe
cted 512 bytes
09:31:04,828 ERROR [STDERR]     at
org.apache.poi.poifs.storage.HeaderBlockReader.alertShortRead(He
aderBlockReader.java:130)
09:31:04,843 ERROR [STDERR]     at
org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBloc
kReader.java:94)
09:31:04,843 ERROR [STDERR]     at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFile
System.java:151)
09:31:04,843 ERROR [STDERR]     at
org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocumen
t.java:133)
09:31:04,843 ERROR [STDERR]     at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor
.java:51)
09:31:04,859 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.parseWordFile(SearchA
pplicationServlet.java:963)
09:31:04,859 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.indexDirectory(Search
ApplicationServlet.java:813)
09:31:04,859 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.index(SearchApplicati
onServlet.java:747)
09:31:04,859 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.processAdd(SearchAppl
icationServlet.java:331)
09:31:04,874 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.doGet(SearchApplicati
onServlet.java:160)
09:31:04,874 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.doPost(SearchApplicat
ionServlet.java:306)
09:31:04,874 ERROR [STDERR]     at
javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
09:31:04,874 ERROR [STDERR]     at
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
09:31:04,874 ERROR [STDERR]     at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter
(ApplicationFilterChain.java:290)
09:31:04,890 ERROR [STDERR]     at
org.apache.catalina.core.ApplicationFilterChain.doFilter(Applica
tionFilterChain.java:206)
09:31:04,890 ERROR [STDERR]     at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHea
derFilter.java:96)
09:31:04,890 ERROR [STDERR]     at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter
(ApplicationFilterChain.java:235)
09:31:04,890 ERROR [STDERR]     at
org.apache.catalina.core.ApplicationFilterChain.doFilter(Applica
tionFilterChain.java:206)
09:31:04,906 ERROR [STDERR]     at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWra
pperValve.java:235)
09:31:04,906 ERROR [STDERR]     at
org.apache.catalina.core.StandardContextValve.invoke(StandardCon
textValve.java:191)
09:31:04,906 ERROR [STDERR]     at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(Se
curityAssociationValve.java:190)
09:31:04,906 ERROR [STDERR]     at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContex
tValve.java:92)
09:31:04,906 ERROR [STDERR]     at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.
process(SecurityContextEstablishmentValve.java:126)
09:31:04,921 ERROR [STDERR]     at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.
invoke(SecurityContextEstablishmentValve.java:70)
09:31:04,921 ERROR [STDERR]     at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostVa
lve.java:127)
09:31:04,921 ERROR [STDERR]     at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportVa
lve.java:102)
09:31:04,968 ERROR [STDERR]     at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(Ca
chedConnectionValve.java:158)
09:31:04,968 ERROR [STDERR]     at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngi
neValve.java:109)
09:31:04,968 ERROR [STDERR]     at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapte
r.java:330)
09:31:04,968 ERROR [STDERR]     at
org.apache.coyote.http11.Http11Processor.process(Http11Processor
.java:828)
09:31:04,968 ERROR [STDERR]     at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.
process(Http11Protocol.java:601)
09:31:04,984 ERROR [STDERR]     at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.ja
va:447)
09:31:04,984 ERROR [STDERR]     at java.lang.Thread.run(Thread.java:595).

Below mentioned is the source code.

private static String parseWordFile(File f) {
        String text = null;
        try {
            WordExtractor we = new WordExtractor(new FileInputStream(f));
            text = we.getText();
        } catch (Exception ex){
            System.out.println("exception occured for ::"+f.getName());
            ex.printStackTrace();
        }


        return text;
    }
where WordExtractor belongs to the package - org.apache.poi.hwpf.extractor

Highly appreciate a quick help in resolving this.

Regards
Suryasnat Das

Re: com.ctc.wstx.exc.WstxLazyException exception while passing the text content of a word doc to SOLR

Posted by Chris Hostetter <ho...@fucit.org>.
: I am using Apache POI parser to parse a Word Doc and extract the text
: content. Then i am passing the text content to SOLR. The Word document has
: many pictures, graphs and tables. But when i am passing the content to SOLR,
: it fails. Here is the exception trace.
: 
: 09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM
: org.apache.solr.common.SolrException log
: SEVERE: [com.ctc.wstx.exc.WstxLazyException]
: com.ctc.wstx.exc.WstxParsingException: Illegal charact
: er entity: expansion character (code 0x7) not a valid XML character
:  at [row,col {unknown-source}]: [40,18]

the error string is fairly self explanatory: on line 40, column 18 you 
have a character that isn't legal in XML (0x7)

(not all UTF-8 characters are legal in XML)

If search the solr archives for "Illegal character" you'll find lots of 
discussion about this and how to deal with this in general.

You might also want to check out this comment pointing out some advantages 
in using Tika instead of using POI directly...

https://issues.apache.org/jira/browse/LUCENE-1559?#action_12681347

..lastly you might wnat to check out this plugin and do all hte hard work 
server side...

http://wiki.apache.org/solr/ExtractingRequestHandler




-Hoss