You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Morten Fangel <fa...@sevengoslings.net> on 2007/03/10 21:01:54 UTC

Adding data as UTF-8

Hi,

I've been working on adding some Solr-integration into my current project, but 
have run into a problem with non-ascii characters.

I send a document like the following:

---
<?xml version="1.0" encoding="UTF-8"?>
<add><doc>
  <field name="question_id">228</field>
  <field name="question_title">Vedhæft billede til min formular</field>
  <field name="userid">26</field>
  <field name="question_text">Jeg har lavet en side som skal info om 
værkstedet Badsetuen i Odense, som er under kraftig omlægning af kommunen - 
dvs nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om 
deres håndværk udført på stedet. 
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/

Nogle ideer ?</field>
  <field name="question_date">2006-05-17T08:44:23Z</field>
  <field name="question_tags">Upload</field>
  <field name="question_tags">HTML</field>
  <field name="question_tags">Email</field>
  <field name="question_tags">Vedhæftning</field>
</doc></add>
---

But when I do a search like "/solr/select/?q=billede" (default search is the 
field "text" which is a multiValued copyField from question_title and 
question_text)

I will get the document back as

---
?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
 ...
</lst>
<result name="response" numFound="1" start="0">
 <doc>
  <date name="question_date">2006-05-17T08:44:23Z</date>
  <int name="question_id">228</int>
  <arr name="question_tags"><str>Upload</str><str>HTML</str><str>Email</str>
	<str>Vedhæftning</str></arr>
  <str name="question_text">Jeg har lavet en side som skal info om værkstedet 
Badsetuen i Odense, som er under kraftig omlægning af kommunen - dvs 
nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om 
deres håndværk udført på stedet. 
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/

Nogle ideer ?</str>
  <str name="question_title">Vedhæft billede til min formular</str>
  <str name="userid">26</str>
 </doc>
</result>
</response>
---

Which is basicly the same text, but displayed as ISO-8859-1. How can this be? 
Do I have to send off some header saying it is UTF-8, or should I just send 
the data as UTF-8 (that produces the correct encoding in answers, but sounds 
like a silly way of doing it)

Any ideas?

Btw, the install-script listed at http://wiki.apache.org/solr/SolrTomcat is a 
bit wrong. Should I just contribute the fixes (new solr dir and name to 
fetch) to the wiki, or will any of you guys rather do it yourself?

Regards
 -fangel

Re: Adding data as UTF-8

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 3/10/07, Walter Underwood <wu...@netflix.com> wrote:
> If it does something different, that is a bug. RFC 3023 is clear. --wunder..

Sure - just wanted to confirm what I'm seeing, thanks!

-Bertrand

Re: Adding data as UTF-8

Posted by Walter Underwood <wu...@netflix.com>.
If it does something different, that is a bug. RFC 3023 is clear. --wunder

On 3/10/07 1:49 PM, "Bertrand Delacretaz" <bd...@apache.org> wrote:

> On 3/10/07, Walter Underwood <wu...@netflix.com> wrote:
>> It is better to use "application/xml". See RFC 3023.
>> Using "text/xml; charset=UTF-8" will override the XML
>> encoding declaration. "application/xml" will not...
> 
> I agree, but did you try this with our example setup, started with
> "java -jar start.jar"?
> 
> It doesn't seem to work here: If I change our example/exampledocs/post.sh to
> use
> 
>    curl $URL --data-binary @$f -H 'Content-type:application/xml'
> 
> instead of
> 
>   curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
> 
> the encoding declaration of my posted XML is ignored, characters are
> interpreted according to my JVM encoding (-Dfile.encoding makes a
> difference in that case).
> 
> Are you seeing something different, or do you know why this is so?
> 
> -Bertrand


Re: Adding data as UTF-8

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 3/10/07, Walter Underwood <wu...@netflix.com> wrote:
> It is better to use "application/xml". See RFC 3023.
> Using "text/xml; charset=UTF-8" will override the XML
> encoding declaration. "application/xml" will not...

I agree, but did you try this with our example setup, started with
"java -jar start.jar"?

It doesn't seem to work here: If I change our example/exampledocs/post.sh to use

   curl $URL --data-binary @$f -H 'Content-type:application/xml'

instead of

  curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'

the encoding declaration of my posted XML is ignored, characters are
interpreted according to my JVM encoding (-Dfile.encoding makes a
difference in that case).

Are you seeing something different, or do you know why this is so?

-Bertrand

Re: Adding data as UTF-8

Posted by Morten Fangel <fa...@sevengoslings.net>.
On Saturday 10 March 2007 21:39, Bertrand Delacretaz wrote:
> On 3/10/07, Morten Fangel <fa...@sevengoslings.net> wrote:
> > ...I send a document like the following:
> >
> > ---
> > <?xml version="1.0" encoding="UTF-8"?>...
>
> I assume you're using your own code to "send" the document?
Indeed. Solr will be integrated (almost) transparently into my framework.. ;)

It'll work pretty much like the act_as_solr RoR implementation, if I'm not 
totally mistaken about that particular implementation.. 
>
> Currently you need to include a "Content-type: text/xml;
> charset=UTF-8" header in your HTTP POST request, and (as you're doing)
> the XML needs to be encoded in UTF-8.
Super. Indeed that fixed it, yes...

-fangel


Re: Adding data as UTF-8

Posted by Morten Fangel <fa...@sevengoslings.net>.
On Saturday 10 March 2007 22:18, Walter Underwood wrote:
> It is better to use "application/xml". See RFC 3023.
> Using "text/xml; charset=UTF-8" will override the XML
> encoding declaration. "application/xml" will not.
Thanks for the info. I've changed the header accordingly.

-fangel

Re: Adding data as UTF-8

Posted by Walter Underwood <wu...@netflix.com>.
It is better to use "application/xml". See RFC 3023.
Using "text/xml; charset=UTF-8" will override the XML
encoding declaration. "application/xml" will not.

wunder

On 3/10/07 12:39 PM, "Bertrand Delacretaz" <bd...@apache.org> wrote:

> On 3/10/07, Morten Fangel <fa...@sevengoslings.net> wrote:
> 
>> ...I send a document like the following:
>> 
>> ---
>> <?xml version="1.0" encoding="UTF-8"?>...
> 
> I assume you're using your own code to "send" the document?
> 
> Currently you need to include a "Content-type: text/xml;
> charset=UTF-8" header in your HTTP POST request, and (as you're doing)
> the XML needs to be encoded in UTF-8.
> 
> See the source code of
> src/java/org/apache/solr/util/SimplePostTool.java for example.
> 
> -Bertrand


Re: Adding data as UTF-8

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 3/10/07, Morten Fangel <fa...@sevengoslings.net> wrote:

> ...I send a document like the following:
>
> ---
> <?xml version="1.0" encoding="UTF-8"?>...

I assume you're using your own code to "send" the document?

Currently you need to include a "Content-type: text/xml;
charset=UTF-8" header in your HTTP POST request, and (as you're doing)
the XML needs to be encoded in UTF-8.

See the source code of
src/java/org/apache/solr/util/SimplePostTool.java for example.

-Bertrand