You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by rohit arora <ro...@yahoo.com> on 2008/12/23 14:50:56 UTC
Unicode characters that are not legal XML characters
Hi,
When i give post command to build my Index on my (databases / XML) file it gives me
an error which is like .
com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 22))
at [row,col {unknown-source}]: [1676,86]
I find a inbuild function in perl to convert all my character data in "UTF-8" format
I find that there are many Unicode Character that are not legal XML Character.
Can any one help me to find the list of all the legal XML Character so that
I can strip all character except those characters.
with regards
Rohit Arora
Re: Unicode characters that are not legal XML characters
Posted by Bryan Talbot <bt...@aeriagames.com>.
I believe you can use the following unicode characters in XML
documents: U+0009, U+000A, U+000D, [U+0020-U+D7FF], [U+E000-U+FFFD],
and [U+10000-U+10FFFF]
One of your documents contains a U0022 character which is an invalid
space character for XML.
http://www.unicode.org/unicode/reports/tr20/#White
If your data is all text, you can probably safely remove the
disallowed whitespace characters.
-Bryan
On Dec 23, 2008, at Dec 23, 5:50 AM, rohit arora wrote:
>
>
> Hi,
>
> When i give post command to build my Index on my (databases / XML)
> file it gives me
> an error which is like .
>
> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
> ((CTRL-CHAR, code 22))
> at [row,col {unknown-source}]: [1676,86]
>
> I find a inbuild function in perl to convert all my character data
> in "UTF-8" format
> I find that there are many Unicode Character that are not legal XML
> Character.
>
> Can any one help me to find the list of all the legal XML Character
> so that
> I can strip all character except those characters.
>
>
> with regards
> Rohit Arora
>
>
>