You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2010/07/20 16:21:50 UTC
[jira] Created: (HADOOP-6868) Text class should provide method to
return byte array of getLength()
Text class should provide method to return byte array of getLength()
--------------------------------------------------------------------
Key: HADOOP-6868
URL: https://issues.apache.org/jira/browse/HADOOP-6868
Project: Hadoop Common
Issue Type: Bug
Components: util
Affects Versions: 0.20.2
Reporter: Ted Yu
People would use the following code to convert Text to String:
String valueString = new String(valueText.getBytes(), "UTF-8");
However, if Text is reused, the above call would return String of monotonically increasing length.
>From 'Hadoop and XML' discussion thread:
The problem I am seeing is between the Map phase and the
Reduce phase, the XML is getting munged. For Example:
</PrivateRate>
</PrivateRateSet>te>
Text should provide method to return byte array of getLength() length.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6868) Text class should provide method to
return byte array of getLength()
Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890305#action_12890305 ]
Ted Yu commented on HADOOP-6868:
--------------------------------
So the correct call in the above use case should be:
String valueString = new String(valueText.getBytes(), 0, valueText.getLength(), "UTF-8");
Text class Java doc should document explicitly about this use case.
> Text class should provide method to return byte array of getLength()
> --------------------------------------------------------------------
>
> Key: HADOOP-6868
> URL: https://issues.apache.org/jira/browse/HADOOP-6868
> Project: Hadoop Common
> Issue Type: Bug
> Components: util
> Affects Versions: 0.20.2
> Reporter: Ted Yu
>
> People would use the following code to convert Text to String:
> String valueString = new String(valueText.getBytes(), "UTF-8");
> However, if Text is reused, the above call would return String of monotonically increasing length.
> From 'Hadoop and XML' discussion thread:
> The problem I am seeing is between the Map phase and the
> Reduce phase, the XML is getting munged. For Example:
> </PrivateRate>
> </PrivateRateSet>te>
> Text should provide method to return byte array of getLength() length.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6868) Text class should provide method to
return byte array of getLength()
Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890374#action_12890374 ]
Scott Carey commented on HADOOP-6868:
-------------------------------------
I don't think that is a bug. Passing around byte arrays larger than the valid data is common practice in Java for performance reasons. Hence, the common method signature containing (byte[] bytes, int len, int offset) and similar. Creating a new byte array for each resize defeats the purpose of re-using the byte array and the Text object -- lower memory allocation and improved CPU cache locality. The byte array here is a buffer, it does not represent the entire string.
Instead, use the 'helper' method, toString().
> Text class should provide method to return byte array of getLength()
> --------------------------------------------------------------------
>
> Key: HADOOP-6868
> URL: https://issues.apache.org/jira/browse/HADOOP-6868
> Project: Hadoop Common
> Issue Type: Bug
> Components: util
> Affects Versions: 0.20.2
> Reporter: Ted Yu
>
> People would use the following code to convert Text to String:
> String valueString = new String(valueText.getBytes(), "UTF-8");
> However, if Text is reused, the above call would return String of monotonically increasing length.
> From 'Hadoop and XML' discussion thread:
> The problem I am seeing is between the Map phase and the
> Reduce phase, the XML is getting munged. For Example:
> </PrivateRate>
> </PrivateRateSet>te>
> Text should provide method to return byte array of getLength() length.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6868) Text class should provide method to
return byte array of getLength()
Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890312#action_12890312 ]
Ted Yu commented on HADOOP-6868:
--------------------------------
>From Peter Minearo:
Let's say you create a Text object and drop in a String that sets the byte array length to 200. Then drop in a a second String that sets the byte array length to 500. Since, the new length is greater than the previous length; the byte array length is reset to the longer length. Now, if you drop in a third String that would set the byte array length to 350; the Text object does not replace the byte array with a new length of 350; it utilizes the greater length of 500 and sets an extra variable to track the "real" length.
So: Text.getBytes().length != Text.getLength()
This does 2 things:
1. Passes around more data than what is needed
2. Makes the Text object confusing to work with
Text.getBytes().length == Text.getLength() - should be the correct behavior.
> Text class should provide method to return byte array of getLength()
> --------------------------------------------------------------------
>
> Key: HADOOP-6868
> URL: https://issues.apache.org/jira/browse/HADOOP-6868
> Project: Hadoop Common
> Issue Type: Bug
> Components: util
> Affects Versions: 0.20.2
> Reporter: Ted Yu
>
> People would use the following code to convert Text to String:
> String valueString = new String(valueText.getBytes(), "UTF-8");
> However, if Text is reused, the above call would return String of monotonically increasing length.
> From 'Hadoop and XML' discussion thread:
> The problem I am seeing is between the Map phase and the
> Reduce phase, the XML is getting munged. For Example:
> </PrivateRate>
> </PrivateRateSet>te>
> Text should provide method to return byte array of getLength() length.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HADOOP-6868) Text class should provide method to
return byte array of getLength()
Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Yu resolved HADOOP-6868.
----------------------------
Resolution: Not A Problem
> Text class should provide method to return byte array of getLength()
> --------------------------------------------------------------------
>
> Key: HADOOP-6868
> URL: https://issues.apache.org/jira/browse/HADOOP-6868
> Project: Hadoop Common
> Issue Type: Bug
> Components: util
> Affects Versions: 0.20.2
> Reporter: Ted Yu
>
> People would use the following code to convert Text to String:
> String valueString = new String(valueText.getBytes(), "UTF-8");
> However, if Text is reused, the above call would return String of monotonically increasing length.
> From 'Hadoop and XML' discussion thread:
> The problem I am seeing is between the Map phase and the
> Reduce phase, the XML is getting munged. For Example:
> </PrivateRate>
> </PrivateRateSet>te>
> Text should provide method to return byte array of getLength() length.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.