You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2009/09/04 02:24:55 UTC

Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

: +61. SOLR-1091: Jetty's use of CESU-8 for code points outside the BMP
: +    resulted in invalid output from the serialized PHP writer. (yonik)

	...

: +  static boolean modifiedUTF8 = System.getProperty("jetty.home") != null;

...that seems really hackish to me, particularly since for all we know 
there are other servlet containers that might have the same problem.

Wouldn't it make more sense to add an init param for the 
PHPSerializedResponseWriter that lets people set this deterministicly in 
solrconfig.xml?  (and then we could set it and comment on it in the 
example configs so things are more transparent)

-Hoss


Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Donovan Jimenez <dj...@conduit-it.com>.
you are correct, it was my misunderstanding of the problem - now that  
I've read more than I ever wanted to know about UCS-2, UTF-16 and  
modified UTF-8, I'm more upto speed.

Thanks for the patience.

On Sep 11, 2009, at 6:32 PM, Yonik Seeley wrote:

> On Fri, Sep 11, 2009 at 6:21 PM, Donovan Jimenez
> <dj...@conduit-it.com> wrote:
>> Is it possible (and would it even help)  to normalize all strings  
>> with
>> regards to surrogate pairs at indexing time instead?
>
> Already done, in a way... there's only one way to represent a
> character outside the BMP in UTF-16 (which is the in-memory encoding
> used by Java String).  Unless I misunderstood what you meant by
> normalization.
>
> -Yonik
> http://www.lucidimagination.com


Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Sep 11, 2009 at 6:21 PM, Donovan Jimenez
<dj...@conduit-it.com> wrote:
> Is it possible (and would it even help)  to normalize all strings with
> regards to surrogate pairs at indexing time instead?

Already done, in a way... there's only one way to represent a
character outside the BMP in UTF-16 (which is the in-memory encoding
used by Java String).  Unless I misunderstood what you meant by
normalization.

-Yonik
http://www.lucidimagination.com

Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Donovan Jimenez <dj...@conduit-it.com>.
Is it possible (and would it even help)  to normalize all strings  
with regards to surrogate pairs at indexing time instead?  or will  
container still possibly differ in byte for byte output?

- Donovan

On Sep 11, 2009, at 5:34 PM, Chris Hostetter wrote:

>
> : > why don't we just output the raw bytes ourselves?
> :
> : That would require writing TextResponseWriter and friends as binary
> : writers, right?  Or did you have a different way in mind for  
> injecting
> : bytes into the output stream?
>
>
> Grr.... you're right. i got so turned arround thinking about
> counting the bytes and encapsulating it all in the PHPS markup i  
> forgot
> that we still want/need the *raw* bytes output as part of the  
> character
> stream ... not some sort of string representation of the bytes ... wow
> that sounds even more comfusing when i look at it on paper.
>
> character data sucks.
>
> I still repeat my earlier vote: let's yank this patch and just  
> document
> that byte counts for strings included in the PHPS format are only  
> accurate
> for characters outside the BMP if the servlet container being used  
> writes
> them correctly -- that seems like a totally fair requirement to  
> having in
> place, and points the figner squarly where in belongs (without  
> putting us
> at risk for having broken code if/when jetty fixes this problem.
>
> Alternately: have an option and/or system property to force/disable  
> this
> behavior even if jetty.home isn't/is set.
>
> Even if we change nothing else: there needs to be a big fat freaking
> disclaimer in the javadocs for the writer that it's looking at the
> jetty.home variable, and explaining why.
>
>
>
> -Hoss
>


Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Chris Hostetter <ho...@fucit.org>.
: > why don't we just output the raw bytes ourselves?
: 
: That would require writing TextResponseWriter and friends as binary
: writers, right?  Or did you have a different way in mind for injecting
: bytes into the output stream?


Grr.... you're right. i got so turned arround thinking about 
counting the bytes and encapsulating it all in the PHPS markup i forgot 
that we still want/need the *raw* bytes output as part of the character 
stream ... not some sort of string representation of the bytes ... wow 
that sounds even more comfusing when i look at it on paper.

character data sucks.

I still repeat my earlier vote: let's yank this patch and just document 
that byte counts for strings included in the PHPS format are only accurate 
for characters outside the BMP if the servlet container being used writes 
them correctly -- that seems like a totally fair requirement to having in 
place, and points the figner squarly where in belongs (without putting us 
at risk for having broken code if/when jetty fixes this problem.

Alternately: have an option and/or system property to force/disable this 
behavior even if jetty.home isn't/is set.

Even if we change nothing else: there needs to be a big fat freaking 
disclaimer in the javadocs for the writer that it's looking at the 
jetty.home variable, and explaining why.



-Hoss


Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Sep 11, 2009 at 5:06 PM, Chris Hostetter
<ho...@fucit.org> wrote:
> why don't we just output the raw bytes ourselves?

That would require writing TextResponseWriter and friends as binary
writers, right?  Or did you have a different way in mind for injecting
bytes into the output stream?

-Yonik
http://www.lucidimagination.com

Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Sep 11, 2009 at 5:06 PM, Chris Hostetter
<ho...@fucit.org> wrote:

> I must be missunderstanding something still ... based on your description,
> it sounds like it shouldn't matter if the encoder knows that it's one
> logical character or not, either way it should wind up outputing the same
> number of bytes....

yes, the # of bytes is different: 6 bytes versus 4 bytes in UTF-8

-- 
Robert Muir
rcmuir@gmail.com

Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Chris Hostetter <ho...@fucit.org>.
: A code point (unicode character) outside of the BMP (basic
: multilingual plane, fits in 16 bits) is represented as two java chars
: - a surrogate pair.  It's a single logical character - see
: String.codePointAt().  In correct UTF-8 it should be encoded as a
: single code point... but Jetty is ignoring the fact that it's a
: surrogate pair and encoding each Java char as it's own code point...
: this is often called modified-UTF8 or CESU-8.
: 
: So... say you have this incorrect CESU-8 that is masquerading as
: UTF-8: all is not lost.
	...

I must be missunderstanding something still ... based on your description, 
it sounds like it shouldn't matter if the encoder knows that it's one 
logical character or not, either way it should wind up outputing the same 
number of bytes....

Except that if that were the case, why would we have had this bug in the 
first place?  clearly i'm still missunderstanding something.

: Bottom line - if we correctly encapsulate whatever the servlet
: container is writing, it's certainly possible for clients to use the
: output correctly.

I still come back to not liking that this is a hardcoded hack just for 
jetty ... it seems like it would be easy for a future version of jetty to 
change this behavior in some way that makes solr start breaking -- jetty 
could fix this bug and break solr's byte counting ... that doesn't seem 
cool

why don't we just output the raw bytes ourselves?  the code to generate 
the byte[] was/is allready there, we're just ignoring it and only using 
the length.



-Hoss


Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Sep 8, 2009 at 7:46 PM, Chris Hostetter<ho...@fucit.org> wrote:
> The modifiedUTF8 boolean only influence the numeric length returned as the
> "s" option ... the actaully "val" string is still written "as is" by the
> servlet container.

Yep.

A code point (unicode character) outside of the BMP (basic
multilingual plane, fits in 16 bits) is represented as two java chars
- a surrogate pair.  It's a single logical character - see
String.codePointAt().  In correct UTF-8 it should be encoded as a
single code point... but Jetty is ignoring the fact that it's a
surrogate pair and encoding each Java char as it's own code point...
this is often called modified-UTF8 or CESU-8.

So... say you have this incorrect CESU-8 that is masquerading as
UTF-8: all is not lost.
- A decoder can unambiguously recognize that the characters decoded
actually form a surrogate pair and correctly decode - but I don't know
if there are aby requirements to do so (doubt it), and I don't know
which do so.
- A decoder decoding into UTF-16 (like Java Strings) will correctly
decode anyway, even if treating each code point in the pair as
separate.

PHP5 doesn't even do unicode, so it won't care.
PHP6 apparently has unicode support - don't know much about it.

Bottom line - if we correctly encapsulate whatever the servlet
container is writing, it's certainly possible for clients to use the
output correctly.

> you've also changed the behavior so that even when
> false==modifiedUTF8, the length is now computed differently then before
> the patch (using UnicodeUtil.UTF16toUTF8) ... it's this second change i
> don't understand based on the context of the bug: why does the length
> value need to be computed differnetly for non-jetty implemntations?

It will be the same length for non-jetty implementations - I just
rewrote the entire method and used the Lucene UTF16toUTF8 for
performance reasons.  (bad developer, bad!)

-Yonik
http://www.lucidimagination.com

Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Sep 11, 2009 at 8:37 PM, Chris Hostetter
<ho...@fucit.org> wrote:
> Isn't that an argument in favor of having an explicit option to control
> how we do the counting? otherwise we're still at risk of the scenerio i
> discribed (ie: jetty fixes the byte conversion code, but we're still
> counting the bytes "wrong" for them)

I tried testing w/ Jetty7 RC5 but didn't get too far (couldn't bring it up).

I've attached an additional patch to SOLR-1091 that adds a boolean
"CESU8" system property.

-Yonik
http://www.lucidimagination.com

Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Chris Hostetter <ho...@fucit.org>.
: > if the container can't correctly output
: > some characters, i see no reason to hide the bug
: 
: Another problem is that it won't reliably break.  The bug breaks our
: encapsulation (before the patch) and thus the client reads the wrong
: number of chars for the string, and who knows what happens after that.
:  The majority of the time will result in an exception, but it really
: depends.  This is the type of stuff (buffer underflows / overflows)
: that could be used to mess with security too... a carefully crafted
: request could inject / change fields in the response and have it look
: valid.

Isn't that an argument in favor of having an explicit option to control 
how we do the counting? otherwise we're still at risk of the scenerio i 
discribed (ie: jetty fixes the byte conversion code, but we're still 
counting the bytes "wrong" for them)

with an explicit option we put control in the hands of the solr admin to 
decide what's right for them (we can even give them a little shell script 
to test it with)


-Hoss


Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Sep 8, 2009 at 7:46 PM, Chris Hostetter
<ho...@fucit.org> wrote:
> if the container can't correctly output
> some characters, i see no reason to hide the bug

Another problem is that it won't reliably break.  The bug breaks our
encapsulation (before the patch) and thus the client reads the wrong
number of chars for the string, and who knows what happens after that.
 The majority of the time will result in an exception, but it really
depends.  This is the type of stuff (buffer underflows / overflows)
that could be used to mess with security too... a carefully crafted
request could inject / change fields in the response and have it look
valid.

-Yonik
http://www.lucidimagination.com

Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Chris Hostetter <ho...@fucit.org>.
: > : +  static boolean modifiedUTF8 = System.getProperty("jetty.home") != null;
: >
: > ...that seems really hackish to me, particularly since for all we know
: > there are other servlet containers that might have the same problem.
: 
: Yeah, it is.
: But it's not really a valid option, it's a bug/limitation in the
: servlet container IMO.  It would also suck to bloat configuration (and
: users brains) with options that don't really control anything, except
: that they must correctly match it up with how their servlet container
: behaves.  And this doesn't actually fix everything - it simply makes
: it such that encapsulation at the transport layer isn't broken - the
: end user will still be getting back incorrect UTF8.

my suggestion for adding an explicit option for the modifiedUTF8 behavior 
to the phps response writer was on the (miss)understanding that the value 
was always correct, it was just the length calculation that was being done 
wrong by jetty (and correct by other implementations) so if we noticed we 
were running in jetty we'd replace the length calculation with our own 
(which might be less efficient).

but i guess i don't really understand the patch ... actually, the more i 
look at it the less i understand it....

The modifiedUTF8 boolean only influence the numeric length returned as the 
"s" option ... the actaully "val" string is still written "as is" by the 
servlet container.  you've also changed the behavior so that even when 
false==modifiedUTF8, the length is now computed differently then before 
the patch (using UnicodeUtil.UTF16toUTF8) ... it's this second change i 
don't understand based on the context of the bug: why does the length 
value need to be computed differnetly for non-jetty implemntations?

: containers hands and do it all ourselves.  Or just don't support any
: servlet containers that can't handle code points outside the BMP?  Or

I'm in favor of this approach .. if the container can't correctly output 
some characters, i see no reason to hide the bug -- if somebody else wants 
to code arround the bug by doing it all in solr that's fine, but untill 
then i don't see an advantage in outputing the correct number of bytes for 
a garbage string -- anybody who really cares is going to need to switch to 
a differnet servlet container anyway.

: is there simply a Jetty config option we've been missing.  It's hard
: to believe that such a popular servlet container can't handle this.

hey, i don't even understand the bug enough to know what to search for to 
try and find an option.



-Hoss

Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.java

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Sep 3, 2009 at 8:24 PM, Chris Hostetter<ho...@fucit.org> wrote:
>
> : +61. SOLR-1091: Jetty's use of CESU-8 for code points outside the BMP
> : +    resulted in invalid output from the serialized PHP writer. (yonik)
>
>        ...
>
> : +  static boolean modifiedUTF8 = System.getProperty("jetty.home") != null;
>
> ...that seems really hackish to me, particularly since for all we know
> there are other servlet containers that might have the same problem.

Yeah, it is.
But it's not really a valid option, it's a bug/limitation in the
servlet container IMO.  It would also suck to bloat configuration (and
users brains) with options that don't really control anything, except
that they must correctly match it up with how their servlet container
behaves.  And this doesn't actually fix everything - it simply makes
it such that encapsulation at the transport layer isn't broken - the
end user will still be getting back incorrect UTF8.

I guess one better fix is to take the UTF8 encoding out of the servlet
containers hands and do it all ourselves.  Or just don't support any
servlet containers that can't handle code points outside the BMP?  Or
is there simply a Jetty config option we've been missing.  It's hard
to believe that such a popular servlet container can't handle this.

-Yonik
http://www.lucidimagination.com