You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Michel Tourn (JIRA)" <ji...@apache.org> on 2006/06/14 02:32:29 UTC

[jira] Created: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

class Text (replacement for class UTF8) was: HADOOP-136
-------------------------------------------------------

         Key: HADOOP-302
         URL: http://issues.apache.org/jira/browse/HADOOP-302
     Project: Hadoop
        Type: Improvement

  Components: io  
    Reporter: Michel Tourn
 Assigned to: Doug Cutting 


Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 

a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
b) the record-IO scheme in o.a.h.record.Utils.java:readInt 

Either way, note that: 

1. UTF8.java and its successor Text.java need to read the length in two ways: 
  1a. consume 1+ bytes from a DataInput and 
  1b. parse the length within a byte array at a given offset 
(1.b is used for the "WritableComparator optimized for UTF8 keys" ). 

o.a.h.record.Utils only supports the DataInput mode. 
It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 

2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
For the byte array case, the varlen-reader utility needs to be extended to return both: 
 the decoded length and the length of the encoded length. 
 (so that the caller can do offset += encodedlength) 
    
3. A String length does not need (small) negative integers. 

4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12419845 ] 

Doug Cutting commented on HADOOP-302:
-------------------------------------

I think we should use this opportunity to switch to standard UTF-8 for persistent data.  Optimized code should try to avoid conversion of these to Java strings.  For example, comparision can be done on the binary form (since this yeilds the same results as lexicographic unicode comparisons).

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-302?page=all ]

Hairong Kuang updated HADOOP-302:
---------------------------------

    Attachment: text.patch

This patch includes Text class which stores a string in the standard utf8 format, compares two strings bytewise in the utf8 order, and many other functions.

This patch also includes a Junit test for the Text class.

Many thanks to Addison Philip for his tim on the design discussion and code review. He also contributed quite a lot of code.

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>                 Key: HADOOP-302
>                 URL: http://issues.apache.org/jira/browse/HADOOP-302
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Michel Tourn
>         Assigned To: Hairong Kuang
>         Attachments: text.patch, VInt.patch
>
>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Bryan Pendleton (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12425388 ] 
            
Bryan Pendleton commented on HADOOP-302:
----------------------------------------

Don't know what the culprits are, but here are two stack traces I got today that killed a 2-hour job. Maybe, for now, validateUTF should be called when first serializing, too. There seem to still be a few bugs in how Text handles content.

java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 3
	at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:152)
	at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:272)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1076)
Caused by: java.nio.charset.MalformedInputException: Input length = 3
	at org.apache.hadoop.io.Text.validateUTF(Text.java:439)
	at org.apache.hadoop.io.Text.validateUTF8(Text.java:419)
	at org.apache.hadoop.io.Text.readFields(Text.java:228)
	at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:82)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:370)
	at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:183)
	at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:149)
	... 3 more

java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 26
	at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:152)
	at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:272)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1076)
Caused by: java.nio.charset.MalformedInputException: Input length = 26
	at org.apache.hadoop.io.Text.validateUTF(Text.java:439)
	at org.apache.hadoop.io.Text.validateUTF8(Text.java:419)
	at org.apache.hadoop.io.Text.readFields(Text.java:228)
	at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:82)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:370)
	at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:183)
	at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:149)
	... 3 more

Unlike last time, this content doesn't contain tabs. Contact me off-list for pointers to the dataset this occured in.

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>                 Key: HADOOP-302
>                 URL: http://issues.apache.org/jira/browse/HADOOP-302
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Michel Tourn
>         Assigned To: Hairong Kuang
>             Fix For: 0.5.0
>
>         Attachments: text.patch, textwrap.patch, VInt.patch
>
>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12420370 ] 

Hairong Kuang commented on HADOOP-302:
--------------------------------------

Sounds great! I believe that ordering by UTF8 is the right way to go.

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12420227 ] 

Doug Cutting commented on HADOOP-302:
-------------------------------------

Re String comparison: The bug here is with Java.  Since we wish to keep our persistent data structures language-independent, we should order by UTF-8, not UTF-16.

The javadoc is confusing:

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#compareTo(java.lang.String)

It says it compares unicode characters, when in fact it compares UTF-16.

So any code that orders by Java String and expects things to align with the Hadoop Text class will be buggy when processing text with surrogate pairs.  We should make this clear in the javadoc.

Does this sound reasonable?

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12420149 ] 

Hairong Kuang commented on HADOOP-302:
--------------------------------------

If we use the recordio scheme, we need to extend it so that it can read a variable-length integer from a byte array. This is for the support of byte-wise comparison.

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-302?page=all ]

Doug Cutting resolved HADOOP-302.
---------------------------------

    Fix Version/s: 0.5.0
       Resolution: Fixed

I just committed this.  Thanks!

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>                 Key: HADOOP-302
>                 URL: http://issues.apache.org/jira/browse/HADOOP-302
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Michel Tourn
>         Assigned To: Hairong Kuang
>             Fix For: 0.5.0
>
>         Attachments: text.patch, VInt.patch
>
>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Milind Bhandarkar (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12420152 ] 

Milind Bhandarkar commented on HADOOP-302:
------------------------------------------

There is support for negative numbers as well in recordio scheme, which is not needed here, thus allowing us to save a few more bits.

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12419872 ] 

eric baldeschwieler commented on HADOOP-302:
--------------------------------------------

+1 on doug's suggestion.  Let's use real UTF8.  Then we can interoperate with more things.

Agreed that we need to use one of the existing variable length encodings.   Inventing another would be counter productive.  My preference would be to use the recordio scheme, since it is already in hadoop.  If we choose to import the lucene version, we should consider using it for recordio too, easy to change now, since it is still new.


> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-302?page=all ]

Hairong Kuang updated HADOOP-302:
---------------------------------

    Attachment: textwrap.patch

Here is the patch that should fix Bryan's problem.

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>                 Key: HADOOP-302
>                 URL: http://issues.apache.org/jira/browse/HADOOP-302
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Michel Tourn
>         Assigned To: Hairong Kuang
>             Fix For: 0.5.0
>
>         Attachments: text.patch, textwrap.patch, VInt.patch
>
>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-302?page=all ]

Hairong Kuang updated HADOOP-302:
---------------------------------

    Attachment: VInt.patch

This patch extracts the zero-compress integer code in the hadoop record into hadoop io. It also adds functions for comparing serialized integers bytewise.

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>                 Key: HADOOP-302
>                 URL: http://issues.apache.org/jira/browse/HADOOP-302
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Michel Tourn
>         Assigned To: Hairong Kuang
>         Attachments: VInt.patch
>
>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12419800 ] 

Hairong Kuang commented on HADOOP-302:
--------------------------------------

There are two issues with the current implementation of UTF8.

The first is that it does not handle over long string. The length of a string is limited to a short, not a int. I'd like to address this problem by storing the length of a string in a variable-length formt. The highest bit of each byte is an extension bit. '1' means that more bytes are followed, while '0' means last byte.

The second is that the class chooses Java modified UTF8 as the serialized form.  Some argue that we should use the standard UTF8. It seems to me that serializing a string to Java modified UTF8 is quite efficient. But it is Java's internal representation. If we want to support inter-programming-language communication, it makes more sense to use the standard UTF8.

Also for the name of the class, could I use "StringWritable"? It is consistent with other classes that implement WritableComparable, like IntWritable, FloatWritable etc. 

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12420135 ] 

Hairong Kuang commented on HADOOP-302:
--------------------------------------

If we use standard UTF8, comparison on the binary form does not produce the same results as the string comparison. See the following example provided by Addison:

> Consider the sequence U+D800 U+DC00 (a surrogate pair). In String comparison, this compares as less than U+E000 (since 
> D800 < E000). In UTF-8 byte comparisons it is greater than E000 (because the lead byte of the Unicode character U+10000 
> encoded by the surrogate pair is 0xF0, which is bigger than lead byte of U+E000, which is 0xEE). 

Is it an issue? 


> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Assigned: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-302?page=all ]

Hairong Kuang reassigned HADOOP-302:
------------------------------------

    Assign To: Hairong Kuang  (was: Doug Cutting)

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Bryan Pendleton (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12423739 ] 
            
Bryan Pendleton commented on HADOOP-302:
----------------------------------------

Started using this in my code... looks like there are still some bugs. Here's a testcase that shouldn't fail. (I have no idea how UTF works, or I'd try to offer an actual solution):

Index: hadoop/src/test/org/apache/hadoop/io/TestText.java
===================================================================
--- hadoop/src/test/org/apache/hadoop/io/TestText.java	(revision 425795)
+++ hadoop/src/test/org/apache/hadoop/io/TestText.java	(working copy)
@@ -87,7 +87,11 @@
 
 
   public void testCoding() throws Exception {
-    for (int i = 0; i < NUM_ITERATIONS; i++) {
+	  String badString = "Bad \t encoding \t testcase.";
+	  Text testCase = new Text(badString);
+	  assertTrue(badString.equals(testCase.toString()));
+
+	  for (int i = 0; i < NUM_ITERATIONS; i++) {
       try {
           // generate a random string
           String before;


> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>                 Key: HADOOP-302
>                 URL: http://issues.apache.org/jira/browse/HADOOP-302
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Michel Tourn
>         Assigned To: Hairong Kuang
>             Fix For: 0.5.0
>
>         Attachments: text.patch, VInt.patch
>
>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136

Posted by "Bryan Pendleton (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12423740 ] 
            
Bryan Pendleton commented on HADOOP-302:
----------------------------------------

Whoops, forgot that inline patches are bad form. The point is, the encoding of "Bad \t encoding \t testcase" fails the verifyUTF check. It seems to have something to do with the double tabs.

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>                 Key: HADOOP-302
>                 URL: http://issues.apache.org/jira/browse/HADOOP-302
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Michel Tourn
>         Assigned To: Hairong Kuang
>             Fix For: 0.5.0
>
>         Attachments: text.patch, VInt.patch
>
>
> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira