You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by RaghavPrabhu <ra...@gmail.com> on 2009/01/02 09:41:31 UTC

How can i omit the illegal characters,when indexing the docs?

Hi all,

  I am extracting the word document using Apache POI,then generate the xml
doc,which is the document that i want to indexing in the solr. The problem
which i faced was,it thrown the error in the browser is shown below.

HTTP Status 500 - Illegal character ((CTRL-CHAR, code 8)) at [row,col
{unknown-source}]: [1,1592] 
com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR,
code 8)) at [row,col {unknown-source}]: [1,1592] at
com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) at
com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:660) at
com.ctc.wstx.sr.BasicStreamReader.readCDataPrimary(BasicStreamReader.java:4240)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromTreeCommentOrCData(BasicStreamReader.java:3280)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2824)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:179)
at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446)
at java.lang.Thread.run(Thread.java:619) 

The extracted word document contains the special character (its like a
square box).How can i omit those characters,when i submit the document to
the solr.


Thanks in advance,
Regards
Prabhu.K


-- 
View this message in context: http://www.nabble.com/How-can-i-omit-the-illegal-characters%2Cwhen-indexing-the-docs--tp21249084p21249084.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How can i omit the illegal characters,when indexing the docs?

Posted by Peter Wolanin <pe...@acquia.com>.
For documents we are indexing via the PHP client, we are currently
using the following regex to strip control characters from each field
that might contain them:

function apachesolr_strip_ctl_chars($text) {
  // See:  http://w3.org/International/questions/qa-forms-utf-8.html
  // Printable utf-8 does not include any of these chars below x7F
  return preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', ' ', $text);
}

-Peter

On Fri, Jan 2, 2009 at 3:41 AM, RaghavPrabhu <ra...@gmail.com> wrote:
>
> Hi all,
>
>  I am extracting the word document using Apache POI,then generate the xml
> doc,which is the document that i want to indexing in the solr. The problem
> which i faced was,it thrown the error in the browser is shown below.
>
> HTTP Status 500 - Illegal character ((CTRL-CHAR, code 8)) at [row,col
> {unknown-source}]: [1,1592]
> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR,
> code 8)) at [row,col {unknown-source}]: [1,1592] at
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) at
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:660) at
> com.ctc.wstx.sr.BasicStreamReader.readCDataPrimary(BasicStreamReader.java:4240)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTreeCommentOrCData(BasicStreamReader.java:3280)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2824)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) at
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> at
> org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:179)
> at
> org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
> at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446)
> at java.lang.Thread.run(Thread.java:619)
>
> The extracted word document contains the special character ( its like a
> square box).How can i omit those characters,when i submit the document to
> the solr.
>
>
> Thanks in advance,
> Regards
> Prabhu.K
>
>
> --
> View this message in context: http://www.nabble.com/How-can-i-omit-the-illegal-characters%2Cwhen-indexing-the-docs--tp21249084p21249084.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--------------------------------------------------------------
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com