You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Yonik Seeley <yo...@apache.org> on 2007/02/01 05:58:02 UTC

charset in POST from browser

It seems that browsers do a form POST in the charset that the page was
encoded in.
Modifying form.jsp in solr/admin seems to work... the data comes
across encoded in UTF8.

The problem is that the charset isn't defined to be UTF-8 in the
headers, so the bytes are assumed to be latin-1.

Is this a problem we can fix in solr, or is it purely container config?

This will mimic what the browser sends back:
curl -i http://localhost:8983/solr/select -d 'q=%C3%AA'

-Yonik

Re: charset in POST from browser

Posted by Chris Hostetter <ho...@fucit.org>.
: Other things might use POST for querying though.  Perhaps they can all
: set a charset while doing so.

well, i can think of a couple of scenerios...

1) POST multipart/* to either /select or the new style URLs ...  the
browsers should put a content-type with a charset on each part; the
ContentStream parsing code Ryan wrote should do the right thing, we only
have to rely on the Servlet Container to do the right thing for the parts
containing servlet request params -- hopefully they use the charset
properly.

2) POST application/x-www-form-urlencoded to new style urls ... see below.

3) POST anything else to the new style urls ... parsed as a raw
ContentStream, charset taken from the content-type -- should work fine.

4) POST application/x-www-form-urlencoded to the current /select ... see
below.

5) POST */* to the /update ... it currently ignores content type and
assumes UTF-8 regardless of servlet container config ... we could
theoretically make it look at the content-type only for the charset and
still ignore the meat of the content-type.

6) GET anything ... see below.

"see below" is a situations where i don't think we can gleam anything from
the request itself -- we have to make an assumption based on config.  for
#2 and #4 we could concievable have a solrconfig.xml option indicating
what charset Solr should assume, and then we can (aparently) use
HttpServletRequest.setCharacterEncoding to specify that's the charset we
want the servlet container to use when parsing the input -- but i don't
think this helps case #6 -- i can't find any portable way to tell the
servlet container how to parse the URL, so if we have to rely on
documentation to instruct people on how to deal with that, we might as
well do the same thing for #2 and #4 (let it be in the servlet container
config instead of hte solrconfig)

(we should of course test all of these scenerios ... i'm just guessing #1,
#3 and #5 all work okay)



-Hoss


Re: charset in POST from browser

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Chris Hostetter <ho...@fucit.org> wrote:
> acctually ... all of the existing forms we have are GET -- so it's kind of
> a moot issue isn't it?

Other things might use POST for querying though.  Perhaps they can all
set a charset while doing so.

> Did you see my other comments from what seemed to be a resin FAQ about
> that mentioned "The character-encoding tag in the resin.conf."

I had tried url-character-encoding (which i shouldn't have to do
because it says it defaults to UTF-8):

   url-character-encoding
      Defines the character encoding to be used for decoding the URL.
Because the HTTP protocol does not specify the encoding to be used,
the server must specify the encoding beforehand.
   Default: utf-8

> sounds like that's what we should recomend to people using Resin ... i
> suspect they wouldn't even *have* to use UTF-8 .. they just have to set it
> to whatever encoding they want to use when POSTing queries.
> if setting character-encoding in the <web-app> tag works for URL encoded
> values, putting this in the resin.conf will probably work for that too.
>
>
>
> -Hoss

Re: charset in POST from browser

Posted by Chris Hostetter <ho...@fucit.org>.
: >    Content-type: application/x-www-form-urlencoded; charset=utf-8
: >
: > ...picking the charset based on the charset of the page containing the
: > form  (i assume you tested and verified this isn't happening?)
:
: Yep, FireFox2.
: I'd serve the page, do a search, kill the solr server, run nc -l -p
: 8983, and run the search again.  The body was encoded correctly, but
: just no charset info.

yeah ... the google cache of
"ppewww.physics.gla.ac.uk/~flavell/charset/form-i18n.html" (URL
currently 403) suggests that browsers don't do this because a lot of old
CGI parsing libraries can't handle it.  RFC2070 section 5.2 suggests that
this is one method that can be used -- but says "The best solution is to
use the "multipart/form-data" media type" ... perhaps if we change the
forms to use that explicitly things would work.

acctually ... all of the existing forms we have are GET -- so it's kind of
a moot issue isn't it?  (i see there's a seperate thread about
resin and UTF-8 in URLs - multipart/form-data wouldn't relaly help in thta
case.


Did you see my other comments from what seemed to be a resin FAQ about
that mentioned "The character-encoding tag in the resin.conf." ... it
sounds like that's what we should recomend to people using Resin ... i
suspect they wouldn't even *have* to use UTF-8 .. they just have to set it
to whatever encoding they want to use when POSTing queries.

if setting character-encoding in the <web-app> tag works for URL encoded
values, putting this in the resin.conf will probably work for that too.



-Hoss


Re: charset in POST from browser

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Chris Hostetter <ho...@fucit.org> wrote:
> : The form that gets sent to the browser is in UTF8, and the browser
> : correctly sends back UTF8 in the post body.  *But* the browser doesn't
> : tell the container what the charset of the body is, so it's up to the
> : container to guess.  By default, resin seems to pick latin-1.
>
> That's really weird ... i could have sworn browsers doing POST of form
> data were suppose to sent a full content-type...
>
>    Content-type: application/x-www-form-urlencoded; charset=utf-8
>
> ...picking the charset based on the charset of the page containing the
> form  (i assume you tested and verified this isn't happening?)

Yep, FireFox2.
I'd serve the page, do a search, kill the solr server, run nc -l -p
8983, and run the search again.  The body was encoded correctly, but
just no charset info.

I tried setting it explicitly by appending to enctype in the form, but
it doesn't go through.

-Yonik

Re: charset in POST from browser

Posted by Chris Hostetter <ho...@fucit.org>.
: The form that gets sent to the browser is in UTF8, and the browser
: correctly sends back UTF8 in the post body.  *But* the browser doesn't
: tell the container what the charset of the body is, so it's up to the
: container to guess.  By default, resin seems to pick latin-1.

That's really weird ... i could have sworn browsers doing POST of form
data were suppose to sent a full content-type...

   Content-type: application/x-www-form-urlencoded; charset=utf-8

...picking the charset based on the charset of the page containing the
form  (i assume you tested and verified this isn't happening?)

a quick google search turned up this page, with this info...

http://www.systemvikar.biz/faq/servlet.xtp



Form character encoding doesn't work

A POST request with application/x-www-form-urlencoded doesn't contain any
information about the character request. So Resin needs to use a set of
heuristics to decode the form. Here's the order:

   1. request.getAttribute("caucho.form.character.encoding")
   2. The response.setContentType() encoding of the page.
   3. The character-encoding tag in the resin.conf.

Resin uses the default character encoding of your JVM to read form data.
To set the encoding to another charset, you'll need to change the
resin.conf as follows:

<http-server character-encoding='Shift_JIS'>
  ...
</http-server>



Re: charset in POST from browser

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Chris Hostetter <ho...@fucit.org> wrote:
> : The problem is that the charset isn't defined to be UTF-8 in the
> : headers, so the bytes are assumed to be latin-1.
> :
> : Is this a problem we can fix in solr, or is it purely container config?
>
> umm... we already fixed this the best way i know how in SOLR-35 ... all of
> the JSPs that have forms should have this in them...
>
> <%@ page contentType="text/html; charset=utf-8" pageEncoding="UTF-8"%>
>
> ...is resin not respecting that?

The form that gets sent to the browser is in UTF8, and the browser
correctly sends back UTF8 in the post body.  *But* the browser doesn't
tell the container what the charset of the body is, so it's up to the
container to guess.  By default, resin seems to pick latin-1.

It seems like we should assume UTF-8 if no charset is sent for a text
content type.

-Yonik

Re: charset in POST from browser

Posted by Chris Hostetter <ho...@fucit.org>.
: The problem is that the charset isn't defined to be UTF-8 in the
: headers, so the bytes are assumed to be latin-1.
:
: Is this a problem we can fix in solr, or is it purely container config?

umm... we already fixed this the best way i know how in SOLR-35 ... all of
the JSPs that have forms should have this in them...

<%@ page contentType="text/html; charset=utf-8" pageEncoding="UTF-8"%>

...is resin not respecting that?




-Hoss