You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by rohit arora <ro...@yahoo.com> on 2008/12/23 14:50:56 UTC

Unicode characters that are not legal XML characters


Hi,

When i give post command to build my Index on my (databases / XML) file it gives me
an error which is like .

com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 22))
 at [row,col {unknown-source}]: [1676,86]

I find a inbuild function in perl to convert all my character data in "UTF-8" format
I find that there are many Unicode Character that are not legal XML Character.

Can any one help me to find the list of all the legal XML Character so that 
I can strip all character except those characters.


with regards
 Rohit Arora



      

Re: Unicode characters that are not legal XML characters

Posted by Bryan Talbot <bt...@aeriagames.com>.
I believe you can use the following unicode characters in XML  
documents: U+0009, U+000A, U+000D, [U+0020-U+D7FF], [U+E000-U+FFFD],  
and [U+10000-U+10FFFF]

One of your documents contains a U0022 character which is an invalid  
space character for XML.

http://www.unicode.org/unicode/reports/tr20/#White

If your data is all text, you can probably safely remove the  
disallowed whitespace characters.


-Bryan




On Dec 23, 2008, at Dec 23, 5:50 AM, rohit arora wrote:

>
>
> Hi,
>
> When i give post command to build my Index on my (databases / XML)  
> file it gives me
> an error which is like .
>
> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character  
> ((CTRL-CHAR, code 22))
>  at [row,col {unknown-source}]: [1676,86]
>
> I find a inbuild function in perl to convert all my character data  
> in "UTF-8" format
> I find that there are many Unicode Character that are not legal XML  
> Character.
>
> Can any one help me to find the list of all the legal XML Character  
> so that
> I can strip all character except those characters.
>
>
> with regards
>  Rohit Arora
>
>
>