You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark Cunningham <ma...@scrazzl.com> on 2011/06/15 14:09:40 UTC

char sets accepted via xml

Hi,

If you submit information to solr using xml, does the server assume you're
using unicode encoded in utf8? And does it accept the whole range of
possible characters in unicode? (For example, characters that require
multiple bytes when encoded in utf-8).

I'm getting quite a few "Invalid UTF-8 middle byte 0x20 (at char #408, byte
#-1)" errors (with different bytes/characters) that seem to be coming from
characters such as the trademark symbol or registered or some characters
that look like normal characters (such as a dash). It comes out as UTF-8
code units (E2 80 93) using this very handy website
http://rishida.net/tools/conversion/

I tried inserting <?xml version="1.0" encoding="utf-8"?> at the start of the
xml however this didn't seem to make much difference.

Anyone else have these issues or know what they might be coming from?

Mark

Re: char sets accepted via xml

Posted by Tom Gross <it...@gmail.com>.
Hi,

I also have this issue with Solr 3.2.0. It is probably this:
https://issues.apache.org/jira/browse/SOLR-2381

Tom

On 06/15/2011 02:09 PM, Mark Cunningham wrote:
> Hi,
>
> If you submit information to solr using xml, does the server assume you're
> using unicode encoded in utf8? And does it accept the whole range of
> possible characters in unicode? (For example, characters that require
> multiple bytes when encoded in utf-8).
>
> I'm getting quite a few "Invalid UTF-8 middle byte 0x20 (at char #408, byte
> #-1)" errors (with different bytes/characters) that seem to be coming from
> characters such as the trademark symbol or registered or some characters
> that look like normal characters (such as a dash). It comes out as UTF-8
> code units (E2 80 93) using this very handy website
> http://rishida.net/tools/conversion/
>
> I tried inserting<?xml version="1.0" encoding="utf-8"?>  at the start of the
> xml however this didn't seem to make much difference.
>
> Anyone else have these issues or know what they might be coming from?
>
> Mark
>


-- 
Auther of the book "Plone 3 Multimedia" - http://amzn.to/dtrp0C

Tom Gross
email..........tom@toms-projekte.de
skype.....................tom_gross
web.........http://toms-projekte.de
blog...http://blog.toms-projekte.de