You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "David Medinets (JIRA)" <ji...@apache.org> on 2012/10/31 02:41:12 UTC

[jira] [Created] (ACCUMULO-840) Allow String-based getBytes calls to pick Charset ending from JVM setting.

David Medinets created ACCUMULO-840:
---------------------------------------

             Summary: Allow String-based getBytes calls to pick Charset ending from JVM setting.
                 Key: ACCUMULO-840
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-840
             Project: Accumulo
          Issue Type: Improvement
    Affects Versions: 1.5.0
            Reporter: David Medinets
            Assignee: David Medinets
            Priority: Minor
             Fix For: 1.5.0


ACCUMULO-836 changed all String-based getBytes() calls to use the UTF-8 standard. However, there is a JVM setting called "jvm.encoding" that should be honored. See http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding for a discussion of JAVA_TOOL_OPTIONS which might be relevant to this topic. http://javarevisited.blogspot.com/2012/01/get-set-default-character-encoding.html is also a good page to read especially the comment on how character encoding is cached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-840) Allow String-based getBytes calls to pick Charset ending from JVM setting.

Posted by "David Medinets (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487940#comment-13487940 ] 

David Medinets commented on ACCUMULO-840:
-----------------------------------------

>From the dev mailing list:

John: Why not just have a configuration in the xml file for setting a global > charset? This way we avoid hard coded settings but also avoid the issue of shared vm issues.

Drew: +1 for a configuration file property -- perhaps this could be worked into the Encoding class


                
> Allow String-based getBytes calls to pick Charset ending from JVM setting.
> --------------------------------------------------------------------------
>
>                 Key: ACCUMULO-840
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-840
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: David Medinets
>            Assignee: David Medinets
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> ACCUMULO-836 changed all String-based getBytes() calls to use the UTF-8 standard. However, there is a JVM setting called "jvm.encoding" that should be honored. See http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding for a discussion of JAVA_TOOL_OPTIONS which might be relevant to this topic. http://javarevisited.blogspot.com/2012/01/get-set-default-character-encoding.html is also a good page to read especially the comment on how character encoding is cached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-840) Allow String-based getBytes calls to pick Charset ending from JVM setting.

Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488024#comment-13488024 ] 

Christopher Tubbs commented on ACCUMULO-840:
--------------------------------------------

There are two issues here. The first is establishing a standard encoding for all Accumulo internal persistent state/metadata, and the second is how to automatically encode API convenience methods that accept String or char[] or CharSequence (from here on, I'll refer to these three collectively as "Strings"). I'll deal with the latter first:

API: It is important to note that Accumulo deals only with bytes. That's it. We don't guarantee a sort order for Strings with arbitrary (or configurable) encoding, though some have asked for custom comparators to achieve fine-grained control over this. Instead, we only guarantee a sort order for bytes, sorted numerically byte-by-byte, from most significant to least. It is important to realize that we only deal with bytes internally, because all of the API decisions appear to be centered around that idea. This is why you almost always see a Text object, because it holds an arbitrary byte array. It is true that Text has a constructor that accepts a String, and it has a very specific encoding when it does so (UTF8 only, as per its documentation). We have copied this behavior in some of our APIs to add convenience methods that accept Strings, because it's easier than forcing users to do write {code:java}new Mutation(new Text("myString".getBytes("UTF8")));{code} It is so much easier to do {code:java}new Mutation("myString");{code}. This does not change the behavior, though. We still expect convenience methods that accept Strings to behave as though we had converted a String to UTF8 and passed in the resulting bytes (in a Text object) to the method.

API (cont.): Now, it may be the case that the API could benefit from convenience wrappers that accept Strings with a specific encoding, or we could change the behavior of those we have to respect the JVM's "file.encoding" property, and simply pre-encode the Strings before we throw their resulting bytes into a Text object. This may be useful and convenient, but this is a VERY LIMITED SCOPE, and it's important to realize that any consideration of changes to the way we encode things should focus on this scope, and not go crazy, changing all instances of "String-based" uses of ".getBytes()" in the code. Regardless of whether we make such changes, though, we should update our Javadocs to ensure that the encoding we use for these convenience methods is described. It is in the case of Mutation... I'm not sure about elsewhere.

INTERNAL: The other scope to consider for encoding has to do with our internal storage (metadata we store in Zookeeper, in the !METADATA table, and other places where Accumulo writes persistent state). It is imperative that we maintain consistency in the way we interpret our persistent state. For this scope, we absolutely should stick to an encoding, but it should be hard-coded (use a Constant or a util method, for convenience), and should not respect any user configurable field. This is important, because a user should be able to change his/her JVM's encoding settings (for the API scope described above) and it should *NOT* affect our ability to read and understand data that we've previously written to Zookeeper or !METADATA (or elsewhere).

INTERNAL (cont.): For the internal, persistent state's encoding, I'm comfortable assuming that we're already treating all persistent Strings storage as UTF-8 encoded (because we move things around in Text objects a lot, and for those things we aren't, we're probably using ASCII, and can safely treat it as UTF-8). If there are any situations where we are storing persistent state ambiguously, based on anything other than the hard-coded UTF-8 encoding, such that it might cause a problem if a user were to change an OS setting, or non-ASCII data can find its way in, we should treat such as a bug.

As far as I see it, these are the only two scopes we need to concern ourselves with when considering encoding.
                
> Allow String-based getBytes calls to pick Charset ending from JVM setting.
> --------------------------------------------------------------------------
>
>                 Key: ACCUMULO-840
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-840
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: David Medinets
>            Assignee: David Medinets
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> ACCUMULO-836 changed all String-based getBytes() calls to use the UTF-8 standard. However, there is a JVM setting called "jvm.encoding" that should be honored. See http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding for a discussion of JAVA_TOOL_OPTIONS which might be relevant to this topic. http://javarevisited.blogspot.com/2012/01/get-set-default-character-encoding.html is also a good page to read especially the comment on how character encoding is cached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira