You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Teague James <te...@insystechinc.com> on 2014/10/01 21:15:36 UTC

Update with non UTF-8 characters

Hello!

I am indexing Solr 4.9.0 using the /update request handler and am getting
errors from Tika - Illegal IOException from
org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by
MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I
believe that this is the result of attempting to pass information to Solr
via CURL as XML in which the data has non UTF characters such as Smart
Quotes (the irony of that name is amazing). So when I:

curl http://10.0.0.10/solr/pp/update?commit=true -H "Content-Type: text/xml"
--data-binary "<add><doc><field name=\"id\">123456</field><field
name=\"observation\">This is some text that was passed from the .NET
application to Solr for indexing. Users typically write in Word then copy
and paste into the .NET application UI which then passes everything to Solr
for indexing. If there are "smart quotes" it crashes, but "regular quotes"
are fine.</field></doc></add>"

I also tried /update/extract, but since this isn't an actual document it
still doesn't work. 

Is there a way to cope with these non UTF-8 characters using the /update
method I'm currently using by altering the content type or something? Maybe
altering the request handler? Or is it by virtue of text/xml that I cannot
use these characters and need to write logic into the application to strip
them out?

Any thoughts or advice would be appreciated! Thanks!

-Teague


Re: Update with non UTF-8 characters

Posted by Chris Hostetter <ho...@fucit.org>.
: I am indexing Solr 4.9.0 using the /update request handler and am getting
: errors from Tika - Illegal IOException from
: org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by
: MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I

FWIW: that error appears to have come from /update/extract .. hard to be 
sure w/o full stack trace from the logs ... but i'll assume that's 
just a copy/paste mistake from the second test you mentioned trying, 
and assume your assessment is correct...

: believe that this is the result of attempting to pass information to Solr
: via CURL as XML in which the data has non UTF characters such as Smart
: Quotes (the irony of that name is amazing). So when I:

...and focus on the example command you mentioned...

: curl http://10.0.0.10/solr/pp/update?commit=true -H "Content-Type: text/xml"
: --data-binary "<add><doc><field name=\"id\">123456</field><field
: name=\"observation\">This is some text that was passed from the .NET
: application to Solr for indexing. Users typically write in Word then copy
: and paste into the .NET application UI which then passes everything to Solr
: for indexing. If there are "smart quotes" it crashes, but "regular quotes"
: are fine.</field></doc></add>"

if you tell solr you are sending it XML, then you have to send it valid 
XML.  if you don't specify a charset -- either in the Content-Type, or in 
an XML prolog declaration -- then the XML spec says UTF-8 must be assumed.  
if the bytes in your doc aren't UTF-8, it's not a valid XML file, etc....

if you actually know what charset you are sending, then you can specify it 
-- and as long as your JVM implementation understands it, it should work.

you can't however just read some raw bytes from somewhere, slap some 
xml-ish lookin strings in front & behind, and hope you have valid xml.

if you use a good XML serialization library in your .Net application to 
generate the messages you send to Solr, then the serialization library 
should help mitigate this probem -- either by specifying the correct 
encoding in the xml prolog it generates for you in it's output, or by 
converting the input "strings" to utf-8, or by giving you a good error 
if/when you ask it to serialize characters that can't be serialized in XML 
(there are some, like null bytes and control sequence).




-Hoss
http://www.lucidworks.com/