You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Michel Tourn (JIRA)" <ji...@apache.org> on 2006/06/03 02:51:30 UTC

[jira] Commented: (HADOOP-136) Overlong UTF8's not handled well

    [ http://issues.apache.org/jira/browse/HADOOP-136?page=comments#action_12414546 ] 

Michel Tourn commented on HADOOP-136:
-------------------------------------

Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8)

a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
b) the record-IO scheme in o.a.h.record.Utils.java:readInt

Either way, note that:

1. UTF8.java and its successor Text.java need to read the length in two ways: 
  1a.  consume 1+ bytes from a DataInput and 
  1b. parse the length within a byte array at a given offset 
(1.b is used for the "WritableComparator optimized for UTF8 keys" ).  

o.a.h.record.Utils only supports the DataInput mode.
It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes

2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
For the byte array case, the varlen-reader utility needs to be extended to return both: 
 the decoded length and the length of the encoded length.
 (so that the caller can do offset += encodedlength)
   
3. A String length does not need (small) negative integers.

4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants  (like -120, -121 -124)



> Overlong UTF8's not handled well
> --------------------------------
>
>          Key: HADOOP-136
>          URL: http://issues.apache.org/jira/browse/HADOOP-136
>      Project: Hadoop
>         Type: Bug

>   Components: io
>     Versions: 0.2
>     Reporter: Dick King
>     Assignee: Michel Tourn
>     Priority: Minor
>      Fix For: 0.4
>  Attachments: largeutf8.patch
>
> When we feed an overlong string to the UTF8 constructor, two suboptimal things happen.
> First, we truncate to 0xffff/3 characters on the assumption that every character takes three bytes in UTF8.  This can truncate strings that don't need it, and it can be overoptimistic since there are characters that render as four bytes in UTF8.
> Second, the code doesn't actually handle four-byte characters.
> Third, there's a behavioral discontinuity.  If the string is "discovered" to be overlong by the arbitrary limit described above, we truncate with a log message, otherwise we signal a RuntimeException.  One feels that both forms of truncation should be treated alike.  However, this issue is concealed by the second issue; the exception will never be thrown because UTF8.utf8Length can't return more than three times the length of its input.
> I would recommend changing UTF8.utf8Length to let its caller know how many characters of the input string will actually fit if there's an overflow [perhaps by returning the negative of that number] and doing the truncation accurately as needed.
> -dk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira