You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Yonik Seeley <yo...@apache.org> on 2007/02/01 04:48:47 UTC

resin and UTF-8 in URLs

So, we've conquered UTF-8 input in URLs for Jetty and Tomcat, so how
about Resin?

Right now, I can't get Resin 3.0.22 to see an e with a circumflex via
the following:

curl -i 'http://localhost:8983/solr/select?q=%C3%AA&echoParams=explicit'

-Yonik

Re: resin and UTF-8 in URLs

Posted by Chris Hostetter <ho...@fucit.org>.
: I am only suggesting it for GET requests where the params are pulled
: off the query string.  Apparently, UTF-8 is the *only* ok URL encoding
:
: http://www.w3.org/International/O-URL-code.html
:
: It is strange, that resin and tomcat don't observe this unless it is
: specified as the default encoding.  If it can't hurt anything, i think
: its a good idea for solr.

my point is that we should not be trying to work arround any bugs / bad
defaults in the servlet containers -- especially if the only way to do it
prevents the possibility of a user explicitly telling us what charset they
want to use, and we ignore them.  At the moment it might not seem like it
can hurt anything, but it might cause problems we haven't thought of (if
not now, then in the future) and it doesn't acctually "fix" any bug in
Solr -- if it's not our bug, we should just document it as an issue some
servlet contains have.

if nothing else it might contribute to a users choice as to what serlet
container to use -- it's not our job to shield users from bad servlet
container implementations.



-Hoss


Re: resin and UTF-8 in URLs

Posted by Ryan McKinley <ry...@gmail.com>.
>
> : If we can do something small that makes the most normal cases work
> : even if the container is not configured, that seems good.
>
> but how do we know the user wants what we consider a "normal cases" to
> work? ... if every servlet container lets you configure your default
> charset differently, we have no easy way to tell if/when they've
> configured the default properly, to know if we should override it.
>

I am only suggesting it for GET requests where the params are pulled
off the query string.  Apparently, UTF-8 is the *only* ok URL encoding

http://www.w3.org/International/O-URL-code.html

It is strange, that resin and tomcat don't observe this unless it is
specified as the default encoding.  If it can't hurt anything, i think
its a good idea for solr.

>
>
> : At the very lease, we should change the examples in:
> : http://wiki.apache.org/solr/SolrResin etc
>
> oh absolutely.
>

done

Re: resin and UTF-8 in URLs

Posted by Yonik Seeley <yo...@apache.org>.
Some standalone tests for charset handling would be nice... something
that we could
use to test the major servlet containers w/ Solr before finalizing a release.

If someone is having problems with international chars, they could
also run the tests against their particular server.

-Yonik

Re: resin and UTF-8 in URLs

Posted by Chris Hostetter <ho...@fucit.org>.
: For XML, I think trusting the XML parser, and not the servlet
: container is a better way to go.
: That means handing the XML parser an InputStream instead of a Reader.

you mean if there is no charset in the content-type? ... yeah, that was
what i (think i) was suggesting as far as XML goes, trust the user.

: There *is* one place I think we should use UTF-8 when there isn't a
: charset specified:
: a POST with "Content-Type: application/x-www-form-urlencoded".
:
: a) You can't get browsers to put a charset there.
: b) Browsers by default encode the form data in the charset of the form.
: c) We know more than the servlet container in this instance... we know
: at least that
:    our admin pages use UTF-8, and that a POST coming from them will be UTF-8.

Hmmm ... okay i guess i can get behind that.  Can we at least agree that
if the client *does* specify a charset in the content-type header we'll
use it? ... browsers may not be doing it, but client libraries can.



-Hoss


Re: resin and UTF-8 in URLs

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Chris Hostetter <ho...@fucit.org> wrote:
> ...the only real question in my mind is what to do if user supplied data
> has *NO* charset information of any kind ... for XML the spec seems very
> clear that in that case you test for UTF-8 or UTF-16 ... but for arbitrary
> streams of character data in other formats (CSV, JSON, etc...) it seems
> like trysting the servlet container to tell us the default encoding is the
> right way to go.

For XML, I think trusting the XML parser, and not the servlet
container is a better way to go.
That means handing the XML parser an InputStream instead of a Reader.

There *is* one place I think we should use UTF-8 when there isn't a
charset specified:
a POST with "Content-Type: application/x-www-form-urlencoded".

a) You can't get browsers to put a charset there.
b) Browsers by default encode the form data in the charset of the form.
c) We know more than the servlet container in this instance... we know
at least that
   our admin pages use UTF-8, and that a POST coming from them will be UTF-8.

-Yonik

Re: resin and UTF-8 in URLs

Posted by Walter Underwood <wu...@netflix.com>.
On 2/1/07 6:00 PM, "Chris Hostetter" <ho...@fucit.org> wrote:

> That may be, but Solr was only publicly available for 9 months before we
> had someone running into confusion because they were tyring to post an XML
> file that wasn't UTF-8 :)
> 
>     http://www.nabble.com/wana-use-CJKAnalyzer-tf2303256.html#a6498685

But that file wasn't a legal XML file in a non-standard encoding,
it was an illegal XML file in UTF-8. I don't think we're planning
on repairing broken XML automatically.

wunder


Re: resin and UTF-8 in URLs

Posted by Chris Hostetter <ho...@fucit.org>.
: The XML spec says that XML parsers are only required to support
: UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different
: encoding for XML, there is no guarantee that a conforming parser
: will accept it.

there may not be a garuntee -- but shouldn't we at least try to respect
the clients wishes?

: Ultraseek has been indexing XML for the past nine years, and
: I remember a single customer that had XML in a non-standard
: encoding. Effectively all real-world XML is in one of the
: standard encodings.

That may be, but Solr was only publicly available for 9 months before we
had someone running into confusion because they were tyring to post an XML
file that wasn't UTF-8 :)

    http://www.nabble.com/wana-use-CJKAnalyzer-tf2303256.html#a6498685

: The right spec for XML over HTTP is RFC 3023. For text/xml
: with no charset spec, the XML must be interpreted as US-ASCII.

I can go along with that ... if there is a specification for a file format
that says which charset should be assumed if it can't be determined then i
agree, that's a case where it makes sense to hardcode "UTF-8" or
"US-ASCII" in Solr ... but that's not justification for using something
like request.setCharacterEncoding("UTF-8") in the SolrDispatcher where it
applies to everything -- it's a justification for hardcoding a default of
US-ASCII or UTF-8 in the XmlUpdateRequestHandler.

as a general rule, it seems like trusting the ServletContainer for the
default is hte rightthing to do.





-Hoss


Re: resin and UTF-8 in URLs

Posted by Walter Underwood <wu...@netflix.com>.
On 2/1/07 3:18 PM, "Chris Hostetter" <ho...@fucit.org> wrote:
>
> As for XML, or any other format a user might POST to solr (or ask solr
> to fetch from a remote source) what possible reason would we have to only
> supporting UTF-8? .. why do you suggest that the XML standard "specify
> UTF-8, [so] we should use UTF-8" ... doesn't the XML standard say we
> should use the charset specified in the content-type if there is one, and
> if not use the encoding specified in the xml header, ie...
> 
> <?xml encoding='EUC-JP'?>

The XML spec says that XML parsers are only required to support
UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different
encoding for XML, there is no guarantee that a conforming parser
will accept it.

Ultraseek has been indexing XML for the past nine years, and
I remember a single customer that had XML in a non-standard
encoding. Effectively all real-world XML is in one of the
standard encodings.

The right spec for XML over HTTP is RFC 3023. For text/xml
with no charset spec, the XML must be interpreted as US-ASCII.
>From section 8.5:

   Omitting the charset parameter is NOT RECOMMENDED for text/xml.  For
   example, even if the contents of the XML MIME entity are UTF-16 or
   UTF-8, or the XML MIME entity has an explicit encoding declaration,
   XML and MIME processors MUST assume the charset is "us-ascii".

wunder



Re: resin and UTF-8 in URLs

Posted by Chris Hostetter <ho...@fucit.org>.
: > Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it
: > anywhere -- not even in the example config: new users shouldn't need to
: > know about have any special solrconfig options that must be (un)set to get
: > Solr to use their servlet container / system default charset.
:
: I strongly disagree. When we use standards like URIs and XML which
: specify UTF-8, we should use UTF-8.

I'm confused:  As far as URI/URLs go, Solr isn't the one decoding them,
and as I said: nothing in the servlet spec suggests that an app has any
say over how the servlet container will decode them, presubably because
they *must* be UTF-8 ... so this is not our problem, and we should go out
of our way to try and force the servlet container to treat the URLs as
utf8.

As for XML, or any other format a user might POST to solr (or ask solr
to fetch from a remote source) what possible reason would we have to only
supporting UTF-8? .. why do you suggest that the XML standard "specify
UTF-8, [so] we should use UTF-8" ... doesn't the XML standard say we
should use the charset specified in the content-type if there is one, and
if not use the encoding specified in the xml header, ie...

	<?xml encoding='EUC-JP'?>

...the only real question in my mind is what to do if user supplied data
has *NO* charset information of any kind ... for XML the spec seems very
clear that in that case you test for UTF-8 or UTF-16 ... but for arbitrary
streams of character data in other formats (CSV, JSON, etc...) it seems
like trysting the servlet container to tell us the default encoding is the
right way to go.



-Hoss


Re: resin and UTF-8 in URLs

Posted by Walter Underwood <wu...@netflix.com>.
On 2/1/07 2:53 PM, "Chris Hostetter" <ho...@fucit.org> wrote:

> Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it
> anywhere -- not even in the example config: new users shouldn't need to
> know about have any special solrconfig options that must be (un)set to get
> Solr to use their servlet container / system default charset.

I strongly disagree. When we use standards like URIs and XML which
specify UTF-8, we should use UTF-8.

If someone has intentionally set defaults which do not comply with
the standards, they can also do the extra work to make Solr behave
in a non-standard way.

I really cannot imagine a real use for that configuration, especially
in a back end server like Solr. In HTML, changing from Shift-JIS to
GB will change the shape of a few kanji characters, but there is
no need to store everything in GB or talk to the servers in GB.

wunder
 


Re: resin and UTF-8 in URLs

Posted by Chris Hostetter <ho...@fucit.org>.
: Let's not make this complicated for situations that we've never
: seen in practice. Java is a Unicode language and always has been.
: Anyone running a Java system with a Shift-JIS default should already
: know the pitfalls, and know them much better than us (and I know a
: lot about Shift-JIS).
:
: The URI spec says UTF-8, so we can be compliant and tell people
: to fix their code. If they need to add specific hacks for their
: broken software, that is OK. We don't need generic design features
: for a few broken clients.

i think the fact that yonik started two seperate threads one about GET
URLs and one about POST is confusing things ... the discussions have
merged and i'm trying to speak in generalities about dealing with all
input.

Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it
anywhere -- not even in the example config: new users shouldn't need to
know about have any special solrconfig options that must be (un)set to get
Solr to use their servlet container / system default charset.


-Hoss


Re: resin and UTF-8 in URLs

Posted by Walter Underwood <wu...@netflix.com>.
Let's not make this complicated for situations that we've never
seen in practice. Java is a Unicode language and always has been.
Anyone running a Java system with a Shift-JIS default should already
know the pitfalls, and know them much better than us (and I know a
lot about Shift-JIS).

The URI spec says UTF-8, so we can be compliant and tell people
to fix their code. If they need to add specific hacks for their
broken software, that is OK. We don't need generic design features
for a few broken clients.

RFC 3896 has been out for two years now. That is long enough for
decently-maintained software to get it right.

wunder

On 2/1/07 2:14 PM, "Chris Hostetter" <ho...@fucit.org> wrote:

> 
> : If we can do something small that makes the most normal cases work
> : even if the container is not configured, that seems good.
> 
> but how do we know the user wants what we consider a "normal cases" to
> work? ... if every servlet container lets you configure your default
> charset differently, we have no easy way to tell if/when they've
> configured the default properly, to know if we should override it.
> 
> If someone does everything in Shift-JIS, and sets up their servlet
> container with Shift-JIS as their default, and installs solr -- i don't
> want them to think Solr sucks because there is a default in Solr they
> don't know about (or know how to disable) that assumes UTF-8.
> 
> On the other hand: if someone really hasn't thought about charsets at all,
> then it doesn't seem that bad to use whatever default their servlet
> container says to use -- as I understand it some containers (tomcat
> included) pick their default based on the JVMs
> configuration (i assume from the "user.language" sysproperty) ... that
> certainly seems like a better default then for us ot asume UTF-8 -- even
> if it is "latin1" for "en", because most novice users are probably okay
> with latin1 ... if you're starting to worry about more complex characters
> that aren't in the default charset your servlet container picks for you,
> then reading a little documentation is a good idea.
> 
> 
> : At the very lease, we should change the examples in:
> : http://wiki.apache.org/solr/SolrResin etc
> 
> oh absolutely.
> 
> 
> 
> 
> -Hoss
> 


Re: resin and UTF-8 in URLs

Posted by Chris Hostetter <ho...@fucit.org>.
: If we can do something small that makes the most normal cases work
: even if the container is not configured, that seems good.

but how do we know the user wants what we consider a "normal cases" to
work? ... if every servlet container lets you configure your default
charset differently, we have no easy way to tell if/when they've
configured the default properly, to know if we should override it.

If someone does everything in Shift-JIS, and sets up their servlet
container with Shift-JIS as their default, and installs solr -- i don't
want them to think Solr sucks because there is a default in Solr they
don't know about (or know how to disable) that assumes UTF-8.

On the other hand: if someone really hasn't thought about charsets at all,
then it doesn't seem that bad to use whatever default their servlet
container says to use -- as I understand it some containers (tomcat
included) pick their default based on the JVMs
configuration (i assume from the "user.language" sysproperty) ... that
certainly seems like a better default then for us ot asume UTF-8 -- even
if it is "latin1" for "en", because most novice users are probably okay
with latin1 ... if you're starting to worry about more complex characters
that aren't in the default charset your servlet container picks for you,
then reading a little documentation is a good idea.


: At the very lease, we should change the examples in:
: http://wiki.apache.org/solr/SolrResin etc

oh absolutely.




-Hoss


Re: resin and UTF-8 in URLs

Posted by Ryan McKinley <ry...@gmail.com>.
> it seems like every servlet container has some way of configuring the
> default, so we should just rely on that and not add our own default
>

I agree, except that in the world of first time (and even seasoned)
web-app/system developers/maintainers, we don't always set things up
properly! or even know how to set things up properly!

If we can do something small that makes the most normal cases work
even if the container is not configured, that seems good.

At the very lease, we should change the examples in:
http://wiki.apache.org/solr/SolrResin etc

to use:

<web-app id="/solr" character-encoding="utf-8">
  <env-entry>
   ...

ryan

Re: resin and UTF-8 in URLs

Posted by Chris Hostetter <ho...@fucit.org>.
: > : >  request.setCharacterEncoding ("utf-8")

: > ...my reading of the servlet spec was that request.setCharacterEncoding
: > only impacted request *body* data, not the URL.

: > According to the javadocs for it, using it also means that if the client
: > is well behaved and *does* set a charset in the Content-Type it will be
: > ignored.
:
: Content-Type for a GET?

oh ... so you guys were suggesting we only call
request.setCharacterEncoding explicitly on a GET request?

hmmm...

1) in theory, a GET request can have Content-Type
2) as i said, i don't see anything in the servlet spec that says
request.setCharacterEncoding should have any effect on how the URL is
parsed -- servlet containers may use it that way, but the spec explicitly
says "body" of the request.

>From what i've seen, URLs are allways suppose to be UTF-8 (RFC 3986
aparently makes this crystal clear but i haven't read it to verify) which
is probably why the servlet spec only talks about the character encoding for
the body.  if a servlet container isn't doing that properly, then we
shouldn't try to work arround it -- we should just document it the wiki
pages for the various major servlet containers.


-Hoss


Re: resin and UTF-8 in URLs

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : > should we add:
> : >  request.setCharacterEncoding ("utf-8")
> : > to GET requests in StandardRequestParser?
> :
> : Perhaps.  I wonder if there's any performance impact, and if it fixes
> : Tomcat's default of latin1 too.
>
> see my comments in the related thread about POST...
>
> http://www.nabble.com/charset-in-POST-from-browser-tf3153057.html#a8744560
>
> ...my reading of the servlet spec was that request.setCharacterEncoding
> only impacted request *body* data, not the URL.

Yeah, hence I wouldn't do it if it only fixed resin, but if it fixed
tomcat too, it would save a lot of people headaches

> According to the javadocs for it, using it also means that if the client
> is well behaved and *does* set a charset in the Content-Type it will be
> ignored.

Content-Type for a GET?

> Solr users should be able to pick their encoding as much as possible -- so
> we definitely shouldnt' do anything that overrides the charset specified
> in the request (if there is one)

Sure.

> but we also shoudn't hardcode UTF-8
> anywhere if possible ... the default charset should come from some config
> -- either the solrconfig or the servlet containers config.

The problem is that one needs to be an expert to figure all this crap out.

Defaulting to UTF-8 in a url-encoded POST (where browsers refuse to
add charset) seems like a good default, and one that will increase
interop and prevent people from getting backed into a corner later.

-Yonik

Re: resin and UTF-8 in URLs

Posted by Chris Hostetter <ho...@fucit.org>.
: > should we add:
: >  request.setCharacterEncoding ("utf-8")
: > to GET requests in StandardRequestParser?
:
: Perhaps.  I wonder if there's any performance impact, and if it fixes
: Tomcat's default of latin1 too.

see my comments in the related thread about POST...

http://www.nabble.com/charset-in-POST-from-browser-tf3153057.html#a8744560

...my reading of the servlet spec was that request.setCharacterEncoding
only impacted request *body* data, not the URL.

According to the javadocs for it, using it also means that if the client
is well behaved and *does* set a charset in the Content-Type it will be
ignored.

Solr users should be able to pick their encoding as much as possible -- so
we definitely shouldnt' do anything that overrides the charset specified
in the request (if there is one) but we also shoudn't hardcode UTF-8
anywhere if possible ... the default charset should come from some config
-- either the solrconfig or the servlet containers config.

it seems like every servlet container has some way of configuring the
default, so we should just rely on that and not add our own default






:
: -Yonik
:



-Hoss


Re: resin and UTF-8 in URLs

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Ryan McKinley <ry...@gmail.com> wrote:
> should we add:
>  request.setCharacterEncoding ("utf-8")
> to GET requests in StandardRequestParser?

Perhaps.  I wonder if there's any performance impact, and if it fixes
Tomcat's default of latin1 too.

-Yonik

Re: resin and UTF-8 in URLs

Posted by Ryan McKinley <ry...@gmail.com>.
should we add:
 request.setCharacterEncoding ("utf-8")
to GET requests in StandardRequestParser?

Re: resin and UTF-8 in URLs

Posted by Yonik Seeley <yo...@apache.org>.
FYI, I talked to Caucho, and for params in the query string of a URI
they use the charset of the request (which defaults to latin1).  It
will likely be fixed in the 3.1 line.

They indicated that setting the charset before asking for the
parameters would also work:
request.setCharacterEncoding ("utf-8")

-Yonik

On 2/1/07, Yonik Seeley <yo...@apache.org> wrote:
> On 2/1/07, Ryan McKinley <ry...@gmail.com> wrote:
> > I just tried this on two systems... it worked on one (I got the ê) and
> > the other I get ê -- both running resin 3.0.21
>
> A co-worker informed me that adding a character-encoding attribute to
> the web-app tag in web.xml will force a charset if not defined.  Seems
> to work for both GET and POST.
>
> <web-app character-encoding="utf-8">
>
> This looks resin-specific though.
>
> -Yonik

Re: resin and UTF-8 in URLs

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Ryan McKinley <ry...@gmail.com> wrote:
> I just tried this on two systems... it worked on one (I got the ê) and
> the other I get ê -- both running resin 3.0.21

A co-worker informed me that adding a character-encoding attribute to
the web-app tag in web.xml will force a charset if not defined.  Seems
to work for both GET and POST.

<web-app character-encoding="utf-8">

This looks resin-specific though.

-Yonik

Re: resin and UTF-8 in URLs

Posted by Ryan McKinley <ry...@gmail.com>.
I just tried this on two systems... it worked on one (I got the ê) and
the other I get ê -- both running resin 3.0.21

The one that works has http://securityfilter.sourceforge.net/ applied.
 I'll look into what securityfilter is doing... it may be setting
something explicitly