You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Morten Fangel <fa...@sevengoslings.net> on 2007/03/10 21:01:54 UTC
Adding data as UTF-8
Hi,
I've been working on adding some Solr-integration into my current project, but
have run into a problem with non-ascii characters.
I send a document like the following:
---
<?xml version="1.0" encoding="UTF-8"?>
<add><doc>
<field name="question_id">228</field>
<field name="question_title">Vedhæft billede til min formular</field>
<field name="userid">26</field>
<field name="question_text">Jeg har lavet en side som skal info om
værkstedet Badsetuen i Odense, som er under kraftig omlægning af kommunen -
dvs nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om
deres håndværk udført på stedet.
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/
Nogle ideer ?</field>
<field name="question_date">2006-05-17T08:44:23Z</field>
<field name="question_tags">Upload</field>
<field name="question_tags">HTML</field>
<field name="question_tags">Email</field>
<field name="question_tags">Vedhæftning</field>
</doc></add>
---
But when I do a search like "/solr/select/?q=billede" (default search is the
field "text" which is a multiValued copyField from question_title and
question_text)
I will get the document back as
---
?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
...
</lst>
<result name="response" numFound="1" start="0">
<doc>
<date name="question_date">2006-05-17T08:44:23Z</date>
<int name="question_id">228</int>
<arr name="question_tags"><str>Upload</str><str>HTML</str><str>Email</str>
<str>Vedhæftning</str></arr>
<str name="question_text">Jeg har lavet en side som skal info om værkstedet
Badsetuen i Odense, som er under kraftig omlægning af kommunen - dvs
nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om
deres håndværk udført på stedet.
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/
Nogle ideer ?</str>
<str name="question_title">Vedhæft billede til min formular</str>
<str name="userid">26</str>
</doc>
</result>
</response>
---
Which is basicly the same text, but displayed as ISO-8859-1. How can this be?
Do I have to send off some header saying it is UTF-8, or should I just send
the data as UTF-8 (that produces the correct encoding in answers, but sounds
like a silly way of doing it)
Any ideas?
Btw, the install-script listed at http://wiki.apache.org/solr/SolrTomcat is a
bit wrong. Should I just contribute the fixes (new solr dir and name to
fetch) to the wiki, or will any of you guys rather do it yourself?
Regards
-fangel
Re: Adding data as UTF-8
Posted by Bertrand Delacretaz <bd...@apache.org>.
On 3/10/07, Walter Underwood <wu...@netflix.com> wrote:
> If it does something different, that is a bug. RFC 3023 is clear. --wunder..
Sure - just wanted to confirm what I'm seeing, thanks!
-Bertrand
Re: Adding data as UTF-8
Posted by Walter Underwood <wu...@netflix.com>.
If it does something different, that is a bug. RFC 3023 is clear. --wunder
On 3/10/07 1:49 PM, "Bertrand Delacretaz" <bd...@apache.org> wrote:
> On 3/10/07, Walter Underwood <wu...@netflix.com> wrote:
>> It is better to use "application/xml". See RFC 3023.
>> Using "text/xml; charset=UTF-8" will override the XML
>> encoding declaration. "application/xml" will not...
>
> I agree, but did you try this with our example setup, started with
> "java -jar start.jar"?
>
> It doesn't seem to work here: If I change our example/exampledocs/post.sh to
> use
>
> curl $URL --data-binary @$f -H 'Content-type:application/xml'
>
> instead of
>
> curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
>
> the encoding declaration of my posted XML is ignored, characters are
> interpreted according to my JVM encoding (-Dfile.encoding makes a
> difference in that case).
>
> Are you seeing something different, or do you know why this is so?
>
> -Bertrand
Re: Adding data as UTF-8
Posted by Bertrand Delacretaz <bd...@apache.org>.
On 3/10/07, Walter Underwood <wu...@netflix.com> wrote:
> It is better to use "application/xml". See RFC 3023.
> Using "text/xml; charset=UTF-8" will override the XML
> encoding declaration. "application/xml" will not...
I agree, but did you try this with our example setup, started with
"java -jar start.jar"?
It doesn't seem to work here: If I change our example/exampledocs/post.sh to use
curl $URL --data-binary @$f -H 'Content-type:application/xml'
instead of
curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
the encoding declaration of my posted XML is ignored, characters are
interpreted according to my JVM encoding (-Dfile.encoding makes a
difference in that case).
Are you seeing something different, or do you know why this is so?
-Bertrand
Re: Adding data as UTF-8
Posted by Morten Fangel <fa...@sevengoslings.net>.
On Saturday 10 March 2007 21:39, Bertrand Delacretaz wrote:
> On 3/10/07, Morten Fangel <fa...@sevengoslings.net> wrote:
> > ...I send a document like the following:
> >
> > ---
> > <?xml version="1.0" encoding="UTF-8"?>...
>
> I assume you're using your own code to "send" the document?
Indeed. Solr will be integrated (almost) transparently into my framework.. ;)
It'll work pretty much like the act_as_solr RoR implementation, if I'm not
totally mistaken about that particular implementation..
>
> Currently you need to include a "Content-type: text/xml;
> charset=UTF-8" header in your HTTP POST request, and (as you're doing)
> the XML needs to be encoded in UTF-8.
Super. Indeed that fixed it, yes...
-fangel
Re: Adding data as UTF-8
Posted by Morten Fangel <fa...@sevengoslings.net>.
On Saturday 10 March 2007 22:18, Walter Underwood wrote:
> It is better to use "application/xml". See RFC 3023.
> Using "text/xml; charset=UTF-8" will override the XML
> encoding declaration. "application/xml" will not.
Thanks for the info. I've changed the header accordingly.
-fangel
Re: Adding data as UTF-8
Posted by Walter Underwood <wu...@netflix.com>.
It is better to use "application/xml". See RFC 3023.
Using "text/xml; charset=UTF-8" will override the XML
encoding declaration. "application/xml" will not.
wunder
On 3/10/07 12:39 PM, "Bertrand Delacretaz" <bd...@apache.org> wrote:
> On 3/10/07, Morten Fangel <fa...@sevengoslings.net> wrote:
>
>> ...I send a document like the following:
>>
>> ---
>> <?xml version="1.0" encoding="UTF-8"?>...
>
> I assume you're using your own code to "send" the document?
>
> Currently you need to include a "Content-type: text/xml;
> charset=UTF-8" header in your HTTP POST request, and (as you're doing)
> the XML needs to be encoded in UTF-8.
>
> See the source code of
> src/java/org/apache/solr/util/SimplePostTool.java for example.
>
> -Bertrand
Re: Adding data as UTF-8
Posted by Bertrand Delacretaz <bd...@apache.org>.
On 3/10/07, Morten Fangel <fa...@sevengoslings.net> wrote:
> ...I send a document like the following:
>
> ---
> <?xml version="1.0" encoding="UTF-8"?>...
I assume you're using your own code to "send" the document?
Currently you need to include a "Content-type: text/xml;
charset=UTF-8" header in your HTTP POST request, and (as you're doing)
the XML needs to be encoded in UTF-8.
See the source code of
src/java/org/apache/solr/util/SimplePostTool.java for example.
-Bertrand