You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rupert Fiasco <ru...@gmail.com> on 2009/07/20 19:00:25 UTC

Indexing issue with XML control characters

During indexing I will often get this error:

SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal
character ((CTRL-CHAR, code 3))
 at [row,col {unknown-source}]: [2,1]
	at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)


By looking at this list and elsewhere I know that I need to filter out
most control characters so I have been employing this regex:

/[\x00-\x08\x0B\x0C\x0E-\x1F]/

But I still get the error. What is strange is that if I re-run my
indexing process after a failure it will work on the previously failed
node and then error out on another node some time later. That is, it
is not deterministic. If I look at the text that is attempted to be
indexed its pure as you can get one (a bunch of medical keywords like
"leg bones" and "nose").

Any ideas would be greatly appreciated.

The platform is:

Solr implementation version: 1.3.0 694707
Lucene implementation version: 2.4-dev 691741
Mac OS X 10.5.7
JVM 1.5.0_19-b02-304


Thanks
/Rupert