You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Michel Tourn (JIRA)" <ji...@apache.org> on 2006/05/06 04:30:48 UTC
[jira] Commented: (HADOOP-136) Overlong UTF8's not handled well

    [ http://issues.apache.org/jira/browse/HADOOP-136?page=comments#action_12378179 ] 

Michel Tourn commented on HADOOP-136:
-------------------------------------

I need a fix for this. (Serialization of long UTF8 strings)

I have two proposals.
I am not addressing 4-byte UTF8 characters.
What would others recommend here?


Option 1. 

An alternate encoding for potentially long Strings. 
Code must explicitely choose to write and read back the "large" version.

To share as much code as possible, 
just add an optional argument to the UTF8 constructors: boolean large
If large mode:
 the length encoding is a VarLenShort (see below)
else:
 the length encoding is a short (the current format)

Note about static methods in class o.a.h.io.UTF8:
This change requires instance state (for the boolean large flag)
So the static versions of the UTF8 methods would ignore this change.
This should not be a problem since the code mentions that 
the static methods are deprecated and will go away,
   
   
   
Option 2. A semi-backward-compatible change.

In fact this is the same change as Option 1, 
except that we always assume large = true

in UTF8 change this:
  int bytes = in.readUnsignedShort();
to this:
  int bytes = in.readVarLenShort(); 
(and similarly for writes)

This is word-level variable-length encoding:
if the highest bit of the length word (16th bit) is set, 
then there is an extension word for the length. 
Total payload: 30 bits worth of length, which is enough.

For short enough Strings, the length encoding is unchanged 
This is why it is semi-backwards-compatible.

What inputs are currently accepted:
 Unicode strings, clipped at 0xffff/3=21845 characters.

What would be backwards compatible:
 Strings of encoded length <= 32767 bytes
 This includes: 
  o content with average character length less than 32767/21845 = 1.5 bytes
  o in partic. all single-byte UTF-8 (ASCII, iso-8859)
 


> Overlong UTF8's not handled well
> --------------------------------
>
>          Key: HADOOP-136
>          URL: http://issues.apache.org/jira/browse/HADOOP-136
>      Project: Hadoop
>         Type: Bug

>   Components: io
>     Reporter: Dick King
>     Priority: Minor

>
> When we feed an overlong string to the UTF8 constructor, two suboptimal things happen.
> First, we truncate to 0xffff/3 characters on the assumption that every character takes three bytes in UTF8.  This can truncate strings that don't need it, and it can be overoptimistic since there are characters that render as four bytes in UTF8.
> Second, the code doesn't actually handle four-byte characters.
> Third, there's a behavioral discontinuity.  If the string is "discovered" to be overlong by the arbitrary limit described above, we truncate with a log message, otherwise we signal a RuntimeException.  One feels that both forms of truncation should be treated alike.  However, this issue is concealed by the second issue; the exception will never be thrown because UTF8.utf8Length can't return more than three times the length of its input.
> I would recommend changing UTF8.utf8Length to let its caller know how many characters of the input string will actually fit if there's an overflow [perhaps by returning the negative of that number] and doing the truncation accurately as needed.
> -dk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira