You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2012/11/02 04:32:12 UTC

[jira] [Commented] (ACCUMULO-836) Specify Charset on getBytes() call for String objects.

    [ https://issues.apache.org/jira/browse/ACCUMULO-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489228#comment-13489228 ] 

Josh Elser commented on ACCUMULO-836:
-------------------------------------

*GrepIterator*: It should be noted (javadoc) that the String being converted to bytes will be treated as UTF-8 encoded bytes or not make the UTF-8 assertion at all. 

*MetadataTable#encode(), DistributedReadWriteLock#getLockData()*: Should note that the byte[] return from the specified method is utf-8 bytes.

*LongCombiner.StringEncoder, StringMax, StringMin, StringSummation, SummingArrayCombiner.StringArrayEncoder, Authorizations, Master#mergeMetadataRecords*: These classes are creating bytes that are UTF-8, but when the bytes are initially read into a String (from a Value typically), the default encoding is used (String constructor that takes a byte array). This leads to inconsistency as the data could have been read as something other than UTF-8 but then written back out as UTF-8. A decision needs to make what to do and that decision needs to be documented.

*ZooStore*: Some awkwardness pops out at me in #setProperty(long, String, Serializable) manually adding bytes to the data to be written to ZooKeeper. I don't think UTF-8 will cause any problems, but it could definitely use some clarification.

*TraceServer.Receiver, IndexMeta, AddFilesWithMissingEntries, MetadataTable*: Writes out a Value in utf-8 bytes, but I'm not confident if there is any case in which a client reading that data would expect something else. Documentation again would be useful. The likelihood of this being an issue is probably small considering that Hadoop's WritableUtils encodes Strings as UTF-8.

I'm still a little concerned about access points to ZooKeeper and !METADATA, but given that ZooReaderWriter was converting the username and password as UTF-8 bytes I feel slightly better. I should dig into that code more tomorrow.

One final statement, I still believe that in the ambiguous cases where core classes read arbitrary bytes and write UTF-8 bytes, Accumulo should be agnostic and not make encoding assertions. In other words, I think we should revert those changes and leave it up to the user to decide how they handle their bytes.
                
> Specify Charset on getBytes() call for String objects.
> ------------------------------------------------------
>
>                 Key: ACCUMULO-836
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-836
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: David Medinets
>            Assignee: David Medinets
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: UTF8.java
>
>
> The comments on ACCUMULO-241 indicate that the build server might have a different default Charset than computers used by developers. Therefore, some of the tests have different results on different computers.
> Every getBytes call on a String object should specify the UTF8 Charset. Unfortunately the codebase has nearly 1,800 getBytes calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira