You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Bryan Pendleton (JIRA)" <ji...@apache.org> on 2006/09/19 23:29:22 UTC

[jira] Created: (HADOOP-550) Text constructure can throw exception

Text constructure can throw exception
-------------------------------------

Key: HADOOP-550
URL: http://issues.apache.org/jira/browse/HADOOP-550
Project: Hadoop
Issue Type: Bug
Reporter: Bryan Pendleton

I finally got back around to moving my working code to using Text objects.

And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.

However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.

I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_ throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).

Please, don't remove UTF8, for now.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Assigned: (HADOOP-550) Text constructure can throw exception

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-550?page=all ]

Hairong Kuang reassigned HADOOP-550:
------------------------------------

    Assignee: Hairong Kuang

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>         Assigned To: Hairong Kuang
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436362 ] 
            
Doug Cutting commented on HADOOP-550:
-------------------------------------

So you're claiming that 'new String(myBytes, "UTF-8") doesn't throw an exception, but that 'new Text(myBytes)' does?

That would imply that String's CharsetDecoder specifies something besides CodingErrorAction.REPORT as its onMalformedInput and onUnmappableCharacter value.

http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CharsetDecoder.html#onMalformedInput(java.nio.charset.CodingErrorAction)

A quick glance at the sources confirms this.

I would support changing the behaviour of Text to use CharsetDecoder.REPLACE for both malformed input and unmappable characters.  That would also mean we could turn off verification.

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Bryan Pendleton (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436368 ] 
            
Bryan Pendleton commented on HADOOP-550:
----------------------------------------

Ah, the bowels of String handling I hadn't uncovered yet....

Yes, I would support that, at least as the default. Perhaps an alternate constructor could be used that still does MalformedInput detection, if the developer prefers that. In my case, and in the case of strings passing through internal hadoop interfaces, I'd rather not see malformed string content causing (typically) pointless errors.

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Addison Phillips (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12437958 ] 
            
Addison Phillips commented on HADOOP-550:
-----------------------------------------

If you want to have *text*, then you need to know the encoding and have some assurance that it is correct. A text buffer that contains random binary data isn't very useful: you can't do any useful *text* processing on it. The String class's behavior was modified post 1.4 so that instead of silently emiting a null string (caused by the buried CharacterCodingException), it instead replaces bad sequences with U+FFFD characters. The String class is a bit lenient about this: it allows non-shortest form UTF-8 (that is, 0xC0 0x80 == U+0000 aka 'NULL'), while Text's validation routine does not permit this (it's a security flaw to process non-shortest form UTF-8). But it doesn't return the original bytes if the input buffer was bad. 

Either way, I think that Text should emulate this behavior and do replacements, although I note that Text objects constructed with buffers that use an encoding other than UTF-8 will just silently do unexpected or bad things (it doesn't matter if you use the new Text class or the old Utf8 class, it happens either way).

Using the ByteBuffer version of the validation method will help implement this.

Users may not be happy to have their binary data buffers being "modified" by the Text class. But I'd maintain that their original records are *not* text records if they contain damaged data. A lot of "mostly-ASCII" buffers are really in Latin-1, but work okay as UTF-8 until you encounter a non-ASCII character. The Text class, as a wrapper around a Unicode text buffer, can identify these cases (where the user has misidentified the encoding). This is usually a bug somewhere else (your data was writting using a default OutputStreamWriter rather than one with UTF-8, for example). Something is wrong: the class should not perform questionable operations on the data. I could warn the programmer (Exception) or do something to prevent relatively worse results (replace silently).  

If what you really want is not a "text buffer" but just a byte[] or bit-bucket, don't use a Text object for it. That isn't what it is for. If you have a buffer that produces errors, you probably need to provide an encoding to convert the buffer or debug why the buffer contains non-UTF-8 in the first place.

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-550) Text constructure can throw exception

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-550?page=all ]

Mahadev konar updated HADOOP-550:
---------------------------------

    Attachment: text_junit.patch

I have added a junit test and also made the set method throw runtimeexception. The junit test takes a static byte array of non utf8 characters and creates a text object and then get the bytes back using text.getBytes() and compares these bytes. Since we are not replcaing non UTF8 characters then I just compare to see if the initial and the final byte array are the same. 

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.6.2
>            Reporter: Bryan Pendleton
>         Assigned To: Hairong Kuang
>             Fix For: 0.7.0
>
>         Attachments: text.patch, text_junit.patch
>
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-550) Text constructure can throw exception

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-550?page=all ]

Hairong Kuang updated HADOOP-550:
---------------------------------

    Attachment: text.patch

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.6.2
>            Reporter: Bryan Pendleton
>         Assigned To: Hairong Kuang
>             Fix For: 0.7.0
>
>         Attachments: text.patch
>
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-550) Text constructure can throw exception

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-550?page=all ]

Hairong Kuang updated HADOOP-550:
---------------------------------

               Status: Patch Available  (was: Open)
        Fix Version/s: 0.7.0
    Affects Version/s: 0.6.2

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.6.2
>            Reporter: Bryan Pendleton
>         Assigned To: Hairong Kuang
>             Fix For: 0.7.0
>
>         Attachments: text.patch
>
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Addison Phillips (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436728 ] 
            
Addison Phillips commented on HADOOP-550:
-----------------------------------------

I had a hand in advising about this code (and wrote some of it). I agree with Doug Cutting that the current implementation is rather too strict. The CodingErrorAction of REPLACE is probably preferable as a default action.

One important difference between Text's implementation of UTF-8 and Java's String class: is: Text validates for non-shortest form UTF-8, while String does not. Non-shortest form UTF-8 is sometimes exploited as a security flaw (and is NOT valid UTF-8, by rule) and the validate() method inside Text prevents non-shortest sequences in Text objects.

I'd also note that non-UTF-8 data is pretty common, not usually because of unpaired surrogates, but rather because of bad encoding identification or because of mixed UTF-8 and binary data. Blowing chunks on that data is not a good choice as the default.

Providing for validation and tailorable reporting, a la NIO, actually would be the best course of action. If the buffer isn't really UTF-8 and turns out to be an entirely different encoding (probably the most common encoding problem), sometimes you might want to catch it as an exception, but most often you'll probably be fine plunging ahead with U+FFFD (one per bad byte). For the reason cited, I'd use the stricter Unicode rules for non-shortest UTF-8, but certainly think that  just throwing an exception is too strict.



> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436412 ] 
            
Hairong Kuang commented on HADOOP-550:
--------------------------------------

In java, supplementary characters, i.e., codepoints that are greater than U+FFFF, are represented by a pair of char values, called surrogates. If  a Text object is constructed from a String containing unpaired surrogates, CharacterCodingException is thrown.

I agree with Sameer that a Text object should contain valid UTF-8. So by default, probably we should replace illegal bytes with "\uFFFD" as Java does instead of throwing an exception. Then everything should work.

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12438547 ] 
            
Doug Cutting commented on HADOOP-550:
-------------------------------------

Two minor nits:

1. Instead of ignoring the CharacterCodingException that should never be thrown, it would be better to throw a RuntimeException.  That way, if it ever does happen, we'll know.

2. It would be good to add unit tests that creates a Text using invalid UTF-8, i.e., random binary data, and check that various methods work as expected.


> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.6.2
>            Reporter: Bryan Pendleton
>         Assigned To: Hairong Kuang
>             Fix For: 0.7.0
>
>         Attachments: text.patch
>
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-550) Text constructure can throw exception

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-550?page=all ]

Doug Cutting updated HADOOP-550:
--------------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Fixed

I just committed this.  Thanks, Hairong & Mahadev!

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.6.2
>            Reporter: Bryan Pendleton
>         Assigned To: Hairong Kuang
>             Fix For: 0.7.0
>
>         Attachments: text.patch, text_junit.patch
>
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436413 ] 
            
Doug Cutting commented on HADOOP-550:
-------------------------------------

I think the default, like new String(), should be not to validate, and to silently replace bad data.  If we want to use this as a replacement for String and UTF8, then we should be exception-compatible, and these classes do not validate nor throw exceptions when bytes are converted to Strings.

I think this is a good default.  In my experience, if input is invalid UTF-8 (which is surprisingly common) I would almost always rather process it as best I can than have to handle exceptions or otherwise disable validation.  I would argue that folks who require that invalid UTF-8 throw exceptions are the minority.

So validation and other encoding-related exceptions should be optional.  We can add a flag to the constructor indicating whether it should validate, we can add a config option for TextInputFormat, etc. to enable validation and exceptions for those who desire.


> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12437954 ] 
            
Hairong Kuang commented on HADOOP-550:
--------------------------------------

Currently class Text is the default class for text inputs. Because only valid UTF8 bytes are allowed in Text, if user inputs contain non-UTF8 bytes and Text records are written back to output files, a side effect is that no-UTF8 bytes are replaced or dropped in the output files. Users may not be happy about this.

An alternative way is to allow non-UTF8 bytes in Text by removing the validation in Text constructors. This gives users greater freedom to handle invalid UTF8 bytes in their code if there is any and allows them to write orginal text records back to output files. 

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12438541 ] 
            
Hairong Kuang commented on HADOOP-550:
--------------------------------------

Thanks for your comments, Addison. Currently Text is the default clas for map/reduce text input files, in which records are deliminated by end-of-line and a line may contain binary data. As a temp fix, I'd like to remove validation from class Text constructors. I am thinking the long term plan is to make class Text to be a subclass of BytesWritable. So its intention is to work as a byte buffer but supports toString/fromString assuming UTF8 encoding. Then I'd like to create a new class to act exactly like what Text does now. This will reduce the pain to maintain backward compatibility.

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>         Assigned To: Hairong Kuang
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-550) Text constructure can throw exception

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436403 ] 
            
Sameer Paranjpye commented on HADOOP-550:
-----------------------------------------

I think we should maintain the invariant that a Text object contains valid UTF-8.

Why not add a constructor to Text that accepts a CharsetDecoder. Together with a method 
'void onMalformedInput(CharsetDecoder action)'. The CharsetDecoder is used as a strategy by Text to determine how it should handle illegal values upon construction or assignment.

If CharsetDecoder is not the right type, we can create a type to represent the strategy.

Are unmappable characters a case we need to consider here? You can get an unmappable character converting from one encoding to another, but there is a 1-1 correspondence between UTF-8 and UTF-16 for all valid Unicode code points.

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_  throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira