You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by knietzie <kn...@yahoo.com> on 2008/12/09 13:59:34 UTC

Re: problem index accented character with release version of solr 1.3

hi joshua,

i'm having the same problem as yours.
just curious, have you found any fix for this?

thnks


Joshua Reedy wrote:
> 
> I have been using a stable dev version of 1.3 for a few months.
> Today, I began testing the final release version, and I encountered a
> strange problem.
> The only thing that has changed in my setup is the solr code (I didn't
> make any config change or change the schema).
> 
> a document has a text field with a value that contains:
> "Andr\005é 3000"
> 
> Indexing the document by itself or as part of a batch, produces the
> following error:
> Sep 17, 2008 5:00:27 PM org.apache.solr.common.SolrException log
> SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal
> character ((CTRL-CHAR, code 5))
>  at [row,col {unknown-source}]: [5,205]
>         at
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
>         at
> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
>         at
> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
>         at
> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
>         at
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
>         at
> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>         at
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327)
>         at
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
>         at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
>         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
>         at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>         at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>         at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>         at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
>         at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>         at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>         at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>         at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
>         at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>         at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>         at java.lang.Thread.run(Thread.java:595)
> 
> The latest version of the solr doesn't seem to like control characters
> (\005, in this case), but previous versions handled them (or at least
> ignored them).
> 
> These characters shouldn't be in my documents, so there's a bug on my
> end to track down.  However, I'm wondering if this was an expected
> change or an unintended consequence of recent work . . .
> 
> 
> 
> 
> -- 
> -------------------------------------------------------------------------------------------------
> Be who you are and say what you feel,
> because those who mind don't matter and
> those who matter don't mind.
>  -- Dr. Seuss
> 
> 

-- 
View this message in context: http://www.nabble.com/problem-index-accented-character-with-release-version-of-solr-1.3-tp19544660p20914244.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: problem index accented character with release version of solr 1.3

Posted by Walter Underwood <wu...@netflix.com>.
Only a few control characters are legal in XML. Removing everthing
but newlines, space, and tab is the right thing to do. --wunder

On 12/9/08 5:45 AM, "Peter Wolanin" <pe...@acquia.com> wrote:

> We have been having this problem also. and have resorted to just
stripping
> control characters before sending the text for
> indexing:

preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', '',
> $text);

-Peter

On Tue, Dec 9, 2008 at 7:59 AM, knietzie <kn...@yahoo.com>
> wrote:
>
> hi joshua,
>
> i'm having the same problem as yours.
> just
> curious, have you found any fix for this?
>
> thnks
>
>
> Joshua Reedy
> wrote:
>>
>> I have been using a stable dev version of 1.3 for a few
> months.
>> Today, I began testing the final release version, and I encountered
> a
>> strange problem.
>> The only thing that has changed in my setup is the
> solr code (I didn't
>> make any config change or change the schema).
>>
>> a
> document has a text field with a value that contains:
>> "Andr\005é
> 3000"
>>
>> Indexing the document by itself or as part of a batch, produces
> the
>> following error:
>> Sep 17, 2008 5:00:27 PM
> org.apache.solr.common.SolrException log
>> SEVERE:
> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal
>> character
> ((CTRL-CHAR, code 5))
>>  at [row,col {unknown-source}]: [5,205]
>>
> at
>> 
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
>>
> at
>> 
> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:466
> 8)
>>         at
>>
> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:412
> 6)
>>         at
>>
> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
>>
> at
>> 
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
> 
>>         at
>>
> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>
> at
>> 
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandle
> r.java:327)
>>         at
>>
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequest
> Handler.java:195)
>>         at
>>
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateReq
> uestHandler.java:123)
>>         at
>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.ja
> va:131)
>>         at
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
>>         at
>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303
> )
>>         at
>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:23
> 2)
>>         at
>>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFi
> lterChain.java:235)
>>         at
>>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChai
> n.java:206)
>>         at
>>
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java
> :233)
>>         at
>>
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java
> :175)
>>         at
>>
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)

> >>         at
>> 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)

> >>         at
>> 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:1
> 09)
>>         at
>>
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>>
> at
>> 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
>>
> at
>> 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11
> Protocol.java:583)
>>         at
>>
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>>
> at java.lang.Thread.run(Thread.java:595)
>>
>> The latest version of the solr
> doesn't seem to like control characters
>> (\005, in this case), but previous
> versions handled them (or at least
>> ignored them).
>>
>> These characters
> shouldn't be in my documents, so there's a bug on my
>> end to track down.
> However, I'm wondering if this was an expected
>> change or an unintended
> consequence of recent work . . .
>>
>>
>>
>>
>> --
>>
> ------------------------------------------------------------------------------
> -------------------
>> Be who you are and say what you feel,
>> because those
> who mind don't matter and
>> those who matter don't mind.
>>  -- Dr.
> Seuss
>>
>>
>
> --
> View this message in context:
> http://www.nabble.com/problem-index-accented-character-with-release-version-of
> -solr-1.3-tp19544660p20914244.html
> Sent from the Solr - User mailing list
> archive at Nabble.com.
>
>



--
> 
--------------------------------------------------------------
Peter M.
> Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com



Re: problem index accented character with release version of solr 1.3

Posted by Peter Wolanin <pe...@acquia.com>.
We have been having this problem also. and have resorted to just
stripping control characters before sending the text for indexing:

preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', '', $text);

-Peter

On Tue, Dec 9, 2008 at 7:59 AM, knietzie <kn...@yahoo.com> wrote:
>
> hi joshua,
>
> i'm having the same problem as yours.
> just curious, have you found any fix for this?
>
> thnks
>
>
> Joshua Reedy wrote:
>>
>> I have been using a stable dev version of 1.3 for a few months.
>> Today, I began testing the final release version, and I encountered a
>> strange problem.
>> The only thing that has changed in my setup is the solr code (I didn't
>> make any config change or change the schema).
>>
>> a document has a text field with a value that contains:
>> "Andr\005é 3000"
>>
>> Indexing the document by itself or as part of a batch, produces the
>> following error:
>> Sep 17, 2008 5:00:27 PM org.apache.solr.common.SolrException log
>> SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal
>> character ((CTRL-CHAR, code 5))
>>  at [row,col {unknown-source}]: [5,205]
>>         at
>> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
>>         at
>> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
>>         at
>> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
>>         at
>> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
>>         at
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
>>         at
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>         at
>> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327)
>>         at
>> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
>>         at
>> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
>>         at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
>>         at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
>>         at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
>>         at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>         at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>         at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>         at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
>>         at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>>         at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>         at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>         at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>>         at
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
>>         at
>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>>         at
>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>>         at java.lang.Thread.run(Thread.java:595)
>>
>> The latest version of the solr doesn't seem to like control characters
>> (\005, in this case), but previous versions handled them (or at least
>> ignored them).
>>
>> These characters shouldn't be in my documents, so there's a bug on my
>> end to track down.  However, I'm wondering if this was an expected
>> change or an unintended consequence of recent work . . .
>>
>>
>>
>>
>> --
>> -------------------------------------------------------------------------------------------------
>> Be who you are and say what you feel,
>> because those who mind don't matter and
>> those who matter don't mind.
>>  -- Dr. Seuss
>>
>>
>
> --
> View this message in context: http://www.nabble.com/problem-index-accented-character-with-release-version-of-solr-1.3-tp19544660p20914244.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--------------------------------------------------------------
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com