You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by David Medinets <da...@gmail.com> on 2012/10/28 22:50:23 UTC

Setting Charset in getBytes() call.

https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680

In this comment, John mentioned that all getBytes() method calls
should be changed to use UTF8. There are about 1,800 getBytes() calls
and not all of them involve String objects. I am working on ways to
identify a subset of these calls to change.

I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
track this issue.

Should we create one static Charset object?

  Class AccumuloDefaultCharset {
    public static Charset UTF8 = Charset.forName("UTF8");
  }

Should we use a static constant?

  public static String UTF8 = "UTF8";

I have found one instance of getBytes() in InputFormatBase:

  protected static byte[] getPassword(Configuration conf) {
    return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes());
  }

Are there any reasons why I can't start specifying the charset? Is
UTF8 the right Charset to use? I am not an expert in non-English
charsets, so guidance would be welcome.

Re: Setting Charset in getBytes() call.

Posted by William Slacum <wi...@accumulo.net>.

Since this only effects Strings, I'm even more inclined to leave the option
at the JVM. Most of our methods that accept a `CharSequence` or `String`
object end up creating a `Text` object based off them, which encodes them
with UTF-8. I'd much rather make it our convention to always convert
`String` to `Text` objects if we need to deal with them in a textual way;
otherwise we're just dealing with `byte[]` when serializing keys and
values.

Now, it's another story if Thrift is serializing `String`s with the JVM
setting...

On Mon, Oct 29, 2012 at 1:00 PM, David Medinets <da...@gmail.com>wrote:

> > David, can you give some sort of feel for the usages of the getBytes()
> > calls? Since most of the API deals with things in terms of Text and
> byte[]
> > (Key and Value decomposed), are most of the usages
> configuration/user-input
> > based as your initial snippet from InputFormatBase showed?
>
> I will post a list of the files that I have changed before I commit. I
> will post the file list as a response in this thread.
>

Re: Setting Charset in getBytes() call.

Posted by David Medinets <da...@gmail.com>.

> David, can you give some sort of feel for the usages of the getBytes()
> calls? Since most of the API deals with things in terms of Text and byte[]
> (Key and Value decomposed), are most of the usages configuration/user-input
> based as your initial snippet from InputFormatBase showed?

I will post a list of the files that I have changed before I commit. I
will post the file list as a response in this thread.

Re: Setting Charset in getBytes() call.

Posted by Ed Kohlwey <ek...@gmail.com>.

I think there might be memory efficiency issues that should pursuade us to
adopt charsets rather than the current approach. I believe (rtfs to be
sure) that charset.decode doesn't deep copy the underlying byte buffer
which is presumably good from a gc standpoint.

Either way, UTF8 is certainly the most widely used charset in existing
deployments and changing the default to something that is non-backwards
compatible is probably a bad idea. I'm not familiar with the
characteristics of the alternatives, but I strongly believe any across the
board change needs to be compatible with existing deployments.

Perhaps a better approach than a JVM option or forcing one standard would
be to create a configuration option.
On Oct 29, 2012 9:22 PM, "Drew Farris" <dr...@apache.org> wrote:

> I have always wondered if there were cases in the API where users are
> forced to use Text when they would otherwise prefer byte[], e.g: stuffing a
> non utf8 byte[] into a Text object to facilitate storage or sorting. Not
> entirely sure whether Text would complain if this were the case. I suspect
> we should seek to elimimate these if they currently exist.
>
> Speaking strictly of user data, I agree that fundamentally, every operation
> should be based upon byte[]. API methods providing Text and String based
> calls should be convience methods where the conversion of text to/from
> bytes is handled explicitly (not relying on platform default encoding or
> properties) and transparently (doing something sensible when the user
> doesn't care or is unaware of the issues surrounding character encoding).
>
> Regarding utf8, is there a need to support arbitrary character encodings
> when persisting bytes to accumulo? Think byte order for lexical sorting,
> fixed vs variable length, etc. Perhaps it would not be unreasonable to
> support explicitly stating a character encoding on table creation?
>
> Drew
>  On Oct 29, 2012 8:47 PM, "Josh Elser" <jo...@gmail.com> wrote:
>
> > +1 Mike.
> >
> > 1. It would be hard for me to believe Key/Value are ever handled
> > internally in terms of Strings, but, if such a case does exist, it would
> be
> > extremely prudent to fix.
> >
> > 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced
> > by other commands [1,2]. It would be good to double check all of the
> other
> > commands.
> >
> > [1] https://github.com/apache/**accumulo/blob/trunk/core/src/**
> > main/java/org/apache/accumulo/**core/util/shell/Shell.java<
> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java
> >
> > [2] https://github.com/apache/**accumulo/blob/trunk/core/src/**
> > main/java/org/apache/accumulo/**core/util/shell/commands/**
> > InsertCommand.java<
> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java
> >
> >
> > On 10/29/2012 8:27 PM, Michael Flester wrote:
> >
> >> I agree with Benson entirely with one caveat. It seems to me that there
> >> might be two categories of things being discussed
> >>
> >>    1. User data (keys and values)
> >>    2. Ancillary things needed for operation of Accumulo (passwords).
> >>
> >> These could well be considered separately. Trying to do anything with
> >> keys and values other than treating them as bytes all of the time
> >> I find quite scary.
> >>
> >> And if this is only being done to satisfy pmd or findbugs, those tools
> >> can be convinced to modify their reporting about this issue.
> >>
> >>
>

Re: Setting Charset in getBytes() call.

Posted by Benson Margulies <bi...@gmail.com>.

UTF-8 allows any String to have to go into the database and get back
out. Are there actually use cases in which some application code
pushes in strings, and another pulls out bytes, and would be perturbed
to find UTF-8 as opposed to some other encoding in the bytes?

Re: Setting Charset in getBytes() call.

Posted by Marc Parisi <ma...@accumulo.net>.

sorry, I meant *for that instance

On Tue, Oct 30, 2012 at 6:28 PM, Marc Parisi <ma...@accumulo.net> wrote:

> Instead of taking the platform encoding ( which can be changed as von
> cloud suggested ), we're creating a class specific private definition? This
> isn't particularly dynamic and an override for those who want to use a
> platform specific encoding. This is a maintenance nightmare. I think before
> you move forward with it, think of someone who doesn't want this change and
> how difficult it would be to utilize their own or a different encoding
> scheme.
>
> At the very least you can create a singleton that creates a dynamic
> encoding for that, which can be modified via the accumulo configuration.
>
>
> On Tue, Oct 30, 2012 at 4:02 PM, Eric Newton <er...@gmail.com>wrote:
>
>> > The current reliance on Text is in my opinion the greatest deficit of
>> the
>> > client API-
>>
>>
>>
>> Heh... if only that were true!
>>
>> But it is a big wart: requires users' to include the hadoop-core library
>> in
>> client code, which is unfortunate.  But it's probably going to be required
>> for the importDirectory() call anyhow.
>>
>> -Eric
>>
>
>

Re: Setting Charset in getBytes() call.

Posted by Marc Parisi <ma...@accumulo.net>.

Instead of taking the platform encoding ( which can be changed as von cloud
suggested ), we're creating a class specific private definition? This isn't
particularly dynamic and an override for those who want to use a platform
specific encoding. This is a maintenance nightmare. I think before you move
forward with it, think of someone who doesn't want this change and how
difficult it would be to utilize their own or a different encoding scheme.

At the very least you can create a singleton that creates a dynamic
encoding for that, which can be modified via the accumulo configuration.

On Tue, Oct 30, 2012 at 4:02 PM, Eric Newton <er...@gmail.com> wrote:

> > The current reliance on Text is in my opinion the greatest deficit of the
> > client API-
>
>
>
> Heh... if only that were true!
>
> But it is a big wart: requires users' to include the hadoop-core library in
> client code, which is unfortunate.  But it's probably going to be required
> for the importDirectory() call anyhow.
>
> -Eric
>

Re: Setting Charset in getBytes() call.

Posted by Eric Newton <er...@gmail.com>.

> The current reliance on Text is in my opinion the greatest deficit of the
> client API-



Heh... if only that were true!

But it is a big wart: requires users' to include the hadoop-core library in
client code, which is unfortunate.  But it's probably going to be required
for the importDirectory() call anyhow.

-Eric

Re: Setting Charset in getBytes() call.

Posted by Ed Kohlwey <ek...@gmail.com>.

Also, on the topic of byte arrays - we should do one better than hbase and
go for ByteBuffers. They are more reusable and long-lived buffers can be
allocated outside the heap and take advantage of OS I/O optimizations.

The current reliance on Text is in my opinion the greatest deficit of the
client API- I have been fiddling with creating a new API, similar to the
work Keith did with typo, but instead looking at introducing generic
superclasses to reduce the API profile.
On Oct 29, 2012 9:22 PM, "Drew Farris" <dr...@apache.org> wrote:

> I have always wondered if there were cases in the API where users are
> forced to use Text when they would otherwise prefer byte[], e.g: stuffing a
> non utf8 byte[] into a Text object to facilitate storage or sorting. Not
> entirely sure whether Text would complain if this were the case. I suspect
> we should seek to elimimate these if they currently exist.
>
> Speaking strictly of user data, I agree that fundamentally, every operation
> should be based upon byte[]. API methods providing Text and String based
> calls should be convience methods where the conversion of text to/from
> bytes is handled explicitly (not relying on platform default encoding or
> properties) and transparently (doing something sensible when the user
> doesn't care or is unaware of the issues surrounding character encoding).
>
> Regarding utf8, is there a need to support arbitrary character encodings
> when persisting bytes to accumulo? Think byte order for lexical sorting,
> fixed vs variable length, etc. Perhaps it would not be unreasonable to
> support explicitly stating a character encoding on table creation?
>
> Drew
>  On Oct 29, 2012 8:47 PM, "Josh Elser" <jo...@gmail.com> wrote:
>
> > +1 Mike.
> >
> > 1. It would be hard for me to believe Key/Value are ever handled
> > internally in terms of Strings, but, if such a case does exist, it would
> be
> > extremely prudent to fix.
> >
> > 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced
> > by other commands [1,2]. It would be good to double check all of the
> other
> > commands.
> >
> > [1] https://github.com/apache/**accumulo/blob/trunk/core/src/**
> > main/java/org/apache/accumulo/**core/util/shell/Shell.java<
> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java
> >
> > [2] https://github.com/apache/**accumulo/blob/trunk/core/src/**
> > main/java/org/apache/accumulo/**core/util/shell/commands/**
> > InsertCommand.java<
> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java
> >
> >
> > On 10/29/2012 8:27 PM, Michael Flester wrote:
> >
> >> I agree with Benson entirely with one caveat. It seems to me that there
> >> might be two categories of things being discussed
> >>
> >>    1. User data (keys and values)
> >>    2. Ancillary things needed for operation of Accumulo (passwords).
> >>
> >> These could well be considered separately. Trying to do anything with
> >> keys and values other than treating them as bytes all of the time
> >> I find quite scary.
> >>
> >> And if this is only being done to satisfy pmd or findbugs, those tools
> >> can be convinced to modify their reporting about this issue.
> >>
> >>
>

Re: Setting Charset in getBytes() call.

Posted by Adam Fuchs <af...@apache.org>.

On Mon, Oct 29, 2012 at 9:22 PM, Drew Farris <dr...@apache.org> wrote:

> I have always wondered if there were cases in the API where users are
> forced to use Text when they would otherwise prefer byte[], e.g: stuffing a
> non utf8 byte[] into a Text object to facilitate storage or sorting. Not
> entirely sure whether Text would complain if this were the case. I suspect
> we should seek to elimimate these if they currently exist.
>

The Text class is essentially a wrapper around a byte[], with some
convenience methods for translating to/from other types. Accumulo only ever
reads bytes out of it, so there is no encoding problem there. We also don't
use most of its convenience methods. Many people see that it is named
"Text" and assume that it only stores human readable text, but that is not
the case. It probably should have been named
"ConvenientByteArrayWrapperWithSomeMemoryEfficiencySupportAndStringOrientedTranslationMethodsThatIsWritableComparable".

I also agree that it would be good to get rid of the reliance on Hadoop's
Text object, especially because people often do not respect getLength() on
the byte[] obtained from getBytes().

Adam

Re: Setting Charset in getBytes() call.

Posted by Drew Farris <dr...@apache.org>.

I have always wondered if there were cases in the API where users are
forced to use Text when they would otherwise prefer byte[], e.g: stuffing a
non utf8 byte[] into a Text object to facilitate storage or sorting. Not
entirely sure whether Text would complain if this were the case. I suspect
we should seek to elimimate these if they currently exist.

Speaking strictly of user data, I agree that fundamentally, every operation
should be based upon byte[]. API methods providing Text and String based
calls should be convience methods where the conversion of text to/from
bytes is handled explicitly (not relying on platform default encoding or
properties) and transparently (doing something sensible when the user
doesn't care or is unaware of the issues surrounding character encoding).

Regarding utf8, is there a need to support arbitrary character encodings
when persisting bytes to accumulo? Think byte order for lexical sorting,
fixed vs variable length, etc. Perhaps it would not be unreasonable to
support explicitly stating a character encoding on table creation?

Drew
 On Oct 29, 2012 8:47 PM, "Josh Elser" <jo...@gmail.com> wrote:

> +1 Mike.
>
> 1. It would be hard for me to believe Key/Value are ever handled
> internally in terms of Strings, but, if such a case does exist, it would be
> extremely prudent to fix.
>
> 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced
> by other commands [1,2]. It would be good to double check all of the other
> commands.
>
> [1] https://github.com/apache/**accumulo/blob/trunk/core/src/**
> main/java/org/apache/accumulo/**core/util/shell/Shell.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java>
> [2] https://github.com/apache/**accumulo/blob/trunk/core/src/**
> main/java/org/apache/accumulo/**core/util/shell/commands/**
> InsertCommand.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java>
>
> On 10/29/2012 8:27 PM, Michael Flester wrote:
>
>> I agree with Benson entirely with one caveat. It seems to me that there
>> might be two categories of things being discussed
>>
>>    1. User data (keys and values)
>>    2. Ancillary things needed for operation of Accumulo (passwords).
>>
>> These could well be considered separately. Trying to do anything with
>> keys and values other than treating them as bytes all of the time
>> I find quite scary.
>>
>> And if this is only being done to satisfy pmd or findbugs, those tools
>> can be convinced to modify their reporting about this issue.
>>
>>

Re: Setting Charset in getBytes() call.

Posted by Josh Elser <jo...@gmail.com>.

I also worked through the changes and found some questionable changes.

https://issues.apache.org/jira/browse/ACCUMULO-836?focusedCommentId=13489228&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13489228

On 10/31/2012 01:02 PM, Christopher Tubbs wrote:
> I've added my own comments to this thread on the ACCUMULO-840 ticket.
> https://issues.apache.org/jira/browse/ACCUMULO-840?focusedCommentId=13488024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13488024
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Tue, Oct 30, 2012 at 10:35 PM, John Vines<vi...@apache.org>  wrote:
>> Why not just have a configuration in the xml file for setting a global
>> charset? This way we avoid hard coded settings but also avoid the issue of
>> shared vm issues.
>>
>> John
>>
>> Sent from my phone, pardon the typos and brevity.
>> On Oct 30, 2012 10:29 PM, "David Medinets"<da...@gmail.com>  wrote:
>>
>>> Re-reading and re-thinking I can see your point about how, by
>>> specifying UTF-8, Accumulo is now flouting the file.encoding
>>> parameter. I'd like to implement a static method inside
>>> core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then
>>> do something like getBytes(Encoding.getCharset()) instead of
>>> hard-coding UTF-8.
>>>
>>> Class Encoding {
>>>    private static final Charset charset = null;
>>>    public Charset getCharset() {
>>>      if (charset == null) {
>>>        charset = Charset.forName(System.getProperty("file.encoding",
>>> "UTF-8"));
>>>      }
>>>      return charset;
>>>    }
>>>    ...
>>> }
>>>

Re: Setting Charset in getBytes() call.

Posted by David Medinets <da...@gmail.com>.

Christopher, thanks for taking the time to writeup your comments. Nicely done.

Re: Setting Charset in getBytes() call.

Posted by Christopher Tubbs <ct...@gmail.com>.

On Fri, Nov 2, 2012 at 3:56 PM, Benson Margulies <bi...@gmail.com> wrote:
> Maybe I'm being particularly dense, but I still think that this is
> being made too complex by failing to enumerate the specific goals.

I agree that there has been a failure to enumerate specific goals with
regard to encoding. I made an attempt to identify potential goals
(scopes), for which encoding matters on this ticket:
https://issues.apache.org/jira/browse/ACCUMULO-840

In there, I identify two considerations:
1) API issues addressing consistency for user data (eg. passwords,
table names, Mutation constructors that take Strings), and
2) INTERNAL issues related to Accumulo storing and reading state that
persists or is communicated between its operating components (a clear
example of this is how we store the !METADATA column family names,
which start out as Java String literals, and get encoded to bytes by
the time it gets stored in the table).

I think #1 can be addressed by simply waiting until somebody presents
a feature request with a use case, and in the meantime, we simply
don't touch it.

I think #2 can be addressed by establishing an internal policy (along
the lines of our codestyle standards) that establishes that Accumulo
will consistently store String data for its internal use as UTF8 when
we have to store that String as bytes, and when we convert such bytes
into Strings, we do so under the assumption it is UTF8. If we can
agree to this policy, anything that is actually non-compliant (i.e.
where there's a possibility it won't be stored or read as UTF8) will
simply be a bug that we apply a very narrowly scoped bugfix to ensure
consistency with the policy. I think David has already identified some
such cases and attempted to fix them in the process of working on
ACCUMULO-836. I think those are fine, but they need to be checked to
ensure that when they are converted back to a String, they are read as
UTF8. However, it might be better if these changes were split into
separate bugs, because even though they are all the same class of bug,
they apply to separate components (eg. "Potential bug - Inconsistent
encoding with Zookeeper data", "Potential bug - Inconsistent encoding
with mapreduce configuration", etc.). These bugs can be identified and
fixed as we encounter them, rather than as an attempt to fix the
entire code base. We shouldn't have to spend a lot of time on them...
we should do the simple thing first: establish the policy.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

Re: Setting Charset in getBytes() call.

Posted by John Vines <vi...@apache.org>.

Client/server mismatch is a giant problem. And the more combustibility we
put into Accumulo the closer we get to users hitting a knowledge barrier
about knowing the specifics of their Accumulo instance. i believe there are
two avenues for dealing with this-
1. Avoid at all costs. Unfortunately, this can ultimately boil down to
users losing features because we don't want them to have any sort of
intimate knowledge of the system.
2. A remote configuration utility. If we can have the client code pull the
configuration from the server, perhaps when Connection is made, we can have
our client APIs consistent on both sides of the channel. I believe a
solution like this could handle the issue Benson mentions, but it also
means we cannot approach this encoding issue with file.encoding.

Personally, I think the second option is an inevitability for us as we do
more and more features which are configuration specific. Either way, it
does seem that file.encoding is not sufficient as we want to avoid the
client code requiring some extremely specific documentation. it might even
be an incompatible configuration with what the client wants to do.

I think we are overgeneralizing this issue though. Josh did a decent job
and starting to hammer away on this. It's not just a matter of us doing
weird things with encodings, but the cases for them. For instance, all
zookeeper operations need to be done the same way across the board. This is
needs to be shared knowledge for both servers and clients. So these should
have charset specifications. But other things (pulling things out of thin
air), such as the client api methods, are outside of the purview. Primarily
because they are not associated with any tables until well after they are
created. So that is a user-space burden and should not be a concern with
us. Or any sort of local string operation.

It boils down to if it directly goes into HDFS, zookeeper, or the !METADATA
table then we should enforce encoding, in the way Dave approached it.
Outside of those scopes I think we should really just leave them the hell
alone because the system shouldn't be messing with user's data.

John

On Fri, Nov 2, 2012 at 3:56 PM, Benson Margulies <bi...@gmail.com>wrote:

> Maybe I'm being particularly dense, but I still think that this is
> being made too complex by failing to enumerate the specific goals.
>
> First case; data for which Accumulo is defined to persistently store
> *characters*, as opposed to bytes. I would hope that, in all such
> cases, we would agree that those characters should be stored in some
> Unicode format, never in some legacy encoding.
>
> Second case; data for which Accumulo is defined to store bytes, but,
> for convenience, an API allows the user to read and write characters.
> In this case, I can imagine two competing API designs. One would be to
> mirror Java, and in all such cases give the user the option of
> specifying the charset, defaulting to file.encoding. The other would
> be to insist on UTF-8. A third possibility - to just respect
> file.encoding - seems to me to be retreading the errors of Java 1.x.
>
> Third case; cases in which the user either supplies a text file for
> Accumulo to read, or asks Accumulo to write a text file. Having an API
> that can default to file.encoding here would be convenient for users,
> who want files in their platform's default encoding. Note that this is
> incompatible with the notion of *setting* file.encoding as an
> implementation technique for getting the string constructor and
> getBytes() to do UTF-8.
>
> Finally for today, I had a hard time following the response to my
> writing on servlets. I'll vastly simplify my presentation: when a user
> of Accumulo writes Java code that calls the Accumulo API, I find it
> unacceptable to require that user to set file.encoding to get correct
> behavior from Accumulo, except as described in the second case above.
> When Accumulo classes are integrated into user applications, Accumulo
> must respect file.encoding, or ignore file.encoding, but it cannot
> require the user to set it to something in particular to get correct
> behavior.
>

Re: Setting Charset in getBytes() call.

Posted by Benson Margulies <bi...@gmail.com>.

Maybe I'm being particularly dense, but I still think that this is
being made too complex by failing to enumerate the specific goals.

First case; data for which Accumulo is defined to persistently store
*characters*, as opposed to bytes. I would hope that, in all such
cases, we would agree that those characters should be stored in some
Unicode format, never in some legacy encoding.

Second case; data for which Accumulo is defined to store bytes, but,
for convenience, an API allows the user to read and write characters.
In this case, I can imagine two competing API designs. One would be to
mirror Java, and in all such cases give the user the option of
specifying the charset, defaulting to file.encoding. The other would
be to insist on UTF-8. A third possibility - to just respect
file.encoding - seems to me to be retreading the errors of Java 1.x.

Third case; cases in which the user either supplies a text file for
Accumulo to read, or asks Accumulo to write a text file. Having an API
that can default to file.encoding here would be convenient for users,
who want files in their platform's default encoding. Note that this is
incompatible with the notion of *setting* file.encoding as an
implementation technique for getting the string constructor and
getBytes() to do UTF-8.

Finally for today, I had a hard time following the response to my
writing on servlets. I'll vastly simplify my presentation: when a user
of Accumulo writes Java code that calls the Accumulo API, I find it
unacceptable to require that user to set file.encoding to get correct
behavior from Accumulo, except as described in the second case above.
When Accumulo classes are integrated into user applications, Accumulo
must respect file.encoding, or ignore file.encoding, but it cannot
require the user to set it to something in particular to get correct
behavior.

Re: Setting Charset in getBytes() call.

Posted by Marc Parisi <ma...@accumulo.net>.

John, that would lead us to a configuration management issue. To keep
configuration files in line would be the same as ensuring file.encoding is
the same across the platform.

The JLS doesn't specify a Charset encoding scheme; however, for quite some
time the file.encoding fall through ( that is, when it's not specified ) is
UTF-8. This could change, and is not backed by the JLS, yet, file.encoding
is. It's a fallthrough, meant to take care of configuration mismanagement.

Further, these changes will have issues if you specify a file.encoding in
your configuration, as you don't always enforce UTF-8 in every String
instance, especially in some of your aggregator changes.

"If Accumulo was only a pile of servers, you could do this. You could
say that part of the configuration process for the servers is to
specify the desired encoding to file.encoding, and your shell scripts
could set UTF-8 by default.

But Accumulo is *not* just a pile of servers. Setting file.encoding
effects the entire JVM. A webapp that uses Accumulo now would need to
have the entire servlet container have a particular setting of
file.encoding. This just does not work in the wild. Even without the
servlet container issue, a user of Accumulo may need to plug it into
an existing code base that has other reasons to set file.encoding, and
will not like it when Accumulo starts to corrupt his or her string
data."

I gather that what you mean is that multiple, transient, execution paths
within the tserver should support multiple encodings; however, setting
file.encoding ensures that the platform, which is encompassed in a JVM,
encodes and decodes values in an understood way ( that's what character set
encodings are meant to enforce ). If a user wishes to have his or her own
execution path ( or their own encoding for an iterator ), then he/she would
likely define this. The fact that we require configuration parameters for
the bulk of these changes in core is an indication that the core API
contains features that are seeping into user functionality. Keep the
encoding/decoding at client code, not within the tserver process. Use
file.encoding for the core project, and our changeset would be much
smaller, require that clients do their own encoding/decoding.

A webapp is a fantastic example; however, let's take it a step further.
Accumulo is JBoss. The iterator/client code is the webapp. We should
separate Accumulo from client and client iterator code to avoid these
design issues and place the onus on the user, not accumulo. In all honesty,
and I'm probably off base, but in the case of iterators, we should move
them to a different package, and if so desired, add options to the
iterators, but there is no need to default to UTF-8. It's been that way for
some time.

On Wed, Oct 31, 2012 at 2:02 PM, Christopher Tubbs <ct...@gmail.com>wrote:

> I've added my own comments to this thread on the ACCUMULO-840 ticket.
>
> https://issues.apache.org/jira/browse/ACCUMULO-840?focusedCommentId=13488024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13488024
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Tue, Oct 30, 2012 at 10:35 PM, John Vines <vi...@apache.org> wrote:
> > Why not just have a configuration in the xml file for setting a global
> > charset? This way we avoid hard coded settings but also avoid the issue
> of
> > shared vm issues.
> >
> > John
> >
> > Sent from my phone, pardon the typos and brevity.
> > On Oct 30, 2012 10:29 PM, "David Medinets" <da...@gmail.com>
> wrote:
> >
> >> Re-reading and re-thinking I can see your point about how, by
> >> specifying UTF-8, Accumulo is now flouting the file.encoding
> >> parameter. I'd like to implement a static method inside
> >> core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then
> >> do something like getBytes(Encoding.getCharset()) instead of
> >> hard-coding UTF-8.
> >>
> >> Class Encoding {
> >>   private static final Charset charset = null;
> >>   public Charset getCharset() {
> >>     if (charset == null) {
> >>       charset = Charset.forName(System.getProperty("file.encoding",
> >> "UTF-8"));
> >>     }
> >>     return charset;
> >>   }
> >>   ...
> >> }
> >>
>

Re: Setting Charset in getBytes() call.

Posted by Christopher Tubbs <ct...@gmail.com>.

I've added my own comments to this thread on the ACCUMULO-840 ticket.
https://issues.apache.org/jira/browse/ACCUMULO-840?focusedCommentId=13488024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13488024

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Tue, Oct 30, 2012 at 10:35 PM, John Vines <vi...@apache.org> wrote:
> Why not just have a configuration in the xml file for setting a global
> charset? This way we avoid hard coded settings but also avoid the issue of
> shared vm issues.
>
> John
>
> Sent from my phone, pardon the typos and brevity.
> On Oct 30, 2012 10:29 PM, "David Medinets" <da...@gmail.com> wrote:
>
>> Re-reading and re-thinking I can see your point about how, by
>> specifying UTF-8, Accumulo is now flouting the file.encoding
>> parameter. I'd like to implement a static method inside
>> core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then
>> do something like getBytes(Encoding.getCharset()) instead of
>> hard-coding UTF-8.
>>
>> Class Encoding {
>>   private static final Charset charset = null;
>>   public Charset getCharset() {
>>     if (charset == null) {
>>       charset = Charset.forName(System.getProperty("file.encoding",
>> "UTF-8"));
>>     }
>>     return charset;
>>   }
>>   ...
>> }
>>

Re: Setting Charset in getBytes() call.

Posted by John Vines <vi...@apache.org>.

Why not just have a configuration in the xml file for setting a global
charset? This way we avoid hard coded settings but also avoid the issue of
shared vm issues.

John

Sent from my phone, pardon the typos and brevity.
On Oct 30, 2012 10:29 PM, "David Medinets" <da...@gmail.com> wrote:

> Re-reading and re-thinking I can see your point about how, by
> specifying UTF-8, Accumulo is now flouting the file.encoding
> parameter. I'd like to implement a static method inside
> core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then
> do something like getBytes(Encoding.getCharset()) instead of
> hard-coding UTF-8.
>
> Class Encoding {
>   private static final Charset charset = null;
>   public Charset getCharset() {
>     if (charset == null) {
>       charset = Charset.forName(System.getProperty("file.encoding",
> "UTF-8"));
>     }
>     return charset;
>   }
>   ...
> }
>

Re: Setting Charset in getBytes() call.

Posted by David Medinets <da...@gmail.com>.

Re-reading and re-thinking I can see your point about how, by
specifying UTF-8, Accumulo is now flouting the file.encoding
parameter. I'd like to implement a static method inside
core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then
do something like getBytes(Encoding.getCharset()) instead of
hard-coding UTF-8.

Class Encoding {
  private static final Charset charset = null;
  public Charset getCharset() {
    if (charset == null) {
      charset = Charset.forName(System.getProperty("file.encoding", "UTF-8"));
    }
    return charset;
  }
  ...
}

Re: Setting Charset in getBytes() call.

Posted by William Slacum <wi...@accumulo.net>.

Accumulo may not be just a set of servers, but it is designed to be a set
of processes, which means having their own JVM. I think this mostly boils
down to an issue of API however-- if Accumulo deals with user's data in
terms of bytes, then this issue is put back on the user, which I'm fine
with as a trade off between configuration versus convention.

There are other cases beyond simply a client API, though, namely
configuration. I'm more comfortable with enforcing some standard there.

On Tue, Oct 30, 2012 at 8:31 PM, Benson Margulies <bi...@gmail.com>wrote:

> On Tue, Oct 30, 2012 at 8:21 PM, Josh Elser <jo...@gmail.com> wrote:
> > On 10/30/2012 7:47 PM, David Medinets wrote:
> >>>
> >>> My issue with this is that you have now hard-coded the fact that
> everyone
> >>> else is going to use UTF-8.
> >>
> >>
> >> Who is everyone else? I agree that I have hard-coded the use of UTF-8.
> >> On the other hand, I've merely codified an existing practice. Thus the
> >> issue is now exposed, the places the convention is used are defined.
> >> Once a consensus is reached, we can implement it with confidence.
> >
> >
> > "Everyone else" is everyone who builds Accumulo since you committed your
> > changes and uses it. Ignoring that, forcing a single charset isn't the
> big
> > issue here (as we've *all* agreed that UTF-8 should not cause any
> > data-correctness issues) so for now I'll just drop it as it's just
> creating
> > confusion.
> >
> > My issue is *how* you implemented the default charset. We already have 3
> > people (Marc, Bill and myself) who have stated that we believe inline
> > charset declaration is not the correct implementation and that using the
> JVM
> > property is the better implementation.
> >
> > I'd encourage others to weigh in to form a complete consensus and shift
> the
> > discussion to that implementation if needed.
> >
> >>
> >>> way to fix the problem. I still contest that setting the desired
> encoding
> >>> (via the appropriate JVM property like Bill Slacum initial suggested)
> is
> >>> the
> >>> proper way to address the issue.
> >>
> >>
> >> It is easy to do both. Create a ByteEncodingInitializer (or somesuch)
> >> class that reads the JVM property and defines a globally used Charset.
> >> The find those utf8 definitions and usages and replace them with the
> >> globally-defined value.
> >
> >
> > Again, by setting the 'file.encoding' JVM parameter, such a class is
> > unnecessary because it should be handled internal to Java. For Oracle/Sun
> > JDK and OpenJDK, setting the "file.encoding" parameter at run time will
> use
> > the provided charset you wanted without actually changing any code.
>
> If Accumulo was only a pile of servers, you could do this. You could
> say that part of the configuration process for the servers is to
> specify the desired encoding to file.encoding, and your shell scripts
> could set UTF-8 by default.
>
> But Accumulo is *not* just a pile of servers. Setting file.encoding
> effects the entire JVM. A webapp that uses Accumulo now would need to
> have the entire servlet container have a particular setting of
> file.encoding. This just does not work in the wild. Even without the
> servlet container issue, a user of Accumulo may need to plug it into
> an existing code base that has other reasons to set file.encoding, and
> will not like it when Accumulo starts to corrupt his or her string
> data.
>

Re: Setting Charset in getBytes() call.

Posted by Benson Margulies <bi...@gmail.com>.

On Tue, Oct 30, 2012 at 8:21 PM, Josh Elser <jo...@gmail.com> wrote:
> On 10/30/2012 7:47 PM, David Medinets wrote:
>>>
>>> My issue with this is that you have now hard-coded the fact that everyone
>>> else is going to use UTF-8.
>>
>>
>> Who is everyone else? I agree that I have hard-coded the use of UTF-8.
>> On the other hand, I've merely codified an existing practice. Thus the
>> issue is now exposed, the places the convention is used are defined.
>> Once a consensus is reached, we can implement it with confidence.
>
>
> "Everyone else" is everyone who builds Accumulo since you committed your
> changes and uses it. Ignoring that, forcing a single charset isn't the big
> issue here (as we've *all* agreed that UTF-8 should not cause any
> data-correctness issues) so for now I'll just drop it as it's just creating
> confusion.
>
> My issue is *how* you implemented the default charset. We already have 3
> people (Marc, Bill and myself) who have stated that we believe inline
> charset declaration is not the correct implementation and that using the JVM
> property is the better implementation.
>
> I'd encourage others to weigh in to form a complete consensus and shift the
> discussion to that implementation if needed.
>
>>
>>> way to fix the problem. I still contest that setting the desired encoding
>>> (via the appropriate JVM property like Bill Slacum initial suggested) is
>>> the
>>> proper way to address the issue.
>>
>>
>> It is easy to do both. Create a ByteEncodingInitializer (or somesuch)
>> class that reads the JVM property and defines a globally used Charset.
>> The find those utf8 definitions and usages and replace them with the
>> globally-defined value.
>
>
> Again, by setting the 'file.encoding' JVM parameter, such a class is
> unnecessary because it should be handled internal to Java. For Oracle/Sun
> JDK and OpenJDK, setting the "file.encoding" parameter at run time will use
> the provided charset you wanted without actually changing any code.

If Accumulo was only a pile of servers, you could do this. You could
say that part of the configuration process for the servers is to
specify the desired encoding to file.encoding, and your shell scripts
could set UTF-8 by default.

But Accumulo is *not* just a pile of servers. Setting file.encoding
effects the entire JVM. A webapp that uses Accumulo now would need to
have the entire servlet container have a particular setting of
file.encoding. This just does not work in the wild. Even without the
servlet container issue, a user of Accumulo may need to plug it into
an existing code base that has other reasons to set file.encoding, and
will not like it when Accumulo starts to corrupt his or her string
data.

Re: Setting Charset in getBytes() call.

Posted by Josh Elser <jo...@gmail.com>.

On 10/30/2012 7:47 PM, David Medinets wrote:
>> My issue with this is that you have now hard-coded the fact that everyone else is going to use UTF-8.
>
> Who is everyone else? I agree that I have hard-coded the use of UTF-8.
> On the other hand, I've merely codified an existing practice. Thus the
> issue is now exposed, the places the convention is used are defined.
> Once a consensus is reached, we can implement it with confidence.

"Everyone else" is everyone who builds Accumulo since you committed your 
changes and uses it. Ignoring that, forcing a single charset isn't the 
big issue here (as we've *all* agreed that UTF-8 should not cause any 
data-correctness issues) so for now I'll just drop it as it's just 
creating confusion.

My issue is *how* you implemented the default charset. We already have 3 
people (Marc, Bill and myself) who have stated that we believe inline 
charset declaration is not the correct implementation and that using the 
JVM property is the better implementation.

I'd encourage others to weigh in to form a complete consensus and shift 
the discussion to that implementation if needed.

>
>> way to fix the problem. I still contest that setting the desired encoding
>> (via the appropriate JVM property like Bill Slacum initial suggested) is the
>> proper way to address the issue.
>
> It is easy to do both. Create a ByteEncodingInitializer (or somesuch)
> class that reads the JVM property and defines a globally used Charset.
> The find those utf8 definitions and usages and replace them with the
> globally-defined value.

Again, by setting the 'file.encoding' JVM parameter, such a class is 
unnecessary because it should be handled internal to Java. For 
Oracle/Sun JDK and OpenJDK, setting the "file.encoding" parameter at run 
time will use the provided charset you wanted without actually changing 
any code.

Re: Setting Charset in getBytes() call.

Posted by David Medinets <da...@gmail.com>.

> My issue with this is that you have now hard-coded the fact that everyone else is going to use UTF-8.

Who is everyone else? I agree that I have hard-coded the use of UTF-8.
On the other hand, I've merely codified an existing practice. Thus the
issue is now exposed, the places the convention is used are defined.
Once a consensus is reached, we can implement it with confidence.

> way to fix the problem. I still contest that setting the desired encoding
> (via the appropriate JVM property like Bill Slacum initial suggested) is the
> proper way to address the issue.

It is easy to do both. Create a ByteEncodingInitializer (or somesuch)
class that reads the JVM property and defines a globally used Charset.
The find those utf8 definitions and usages and replace them with the
globally-defined value.

Re: Setting Charset in getBytes() call.

Posted by Josh Elser <jo...@gmail.com>.

On 10/29/2012 10:47 PM, David Medinets wrote:
>
> The code compiles and the tests run. I don't see any reason why I
> should not commit my work to the trunk for v1.5.0. I don't want to
> cause disharmony but I can't see the harm. And even if my change
> causes some problem, wouldn't it be better to know that while v1.5.0
> is still being actively developed?
>

My issue with this is that you have now hard-coded the fact that 
everyone else is going to use UTF-8. Yes, it most likely won't affect 
any compilation or tests (or even 90% of users), but I do not agree that 
this is the best way to fix the problem. I still contest that setting 
the desired encoding (via the appropriate JVM property like Bill Slacum 
initial suggested) is the proper way to address the issue.

I don't feel like we ever actually came to a consensus on this discussion.

Re: Setting Charset in getBytes() call.

Posted by David Medinets <da...@gmail.com>.

I've looked at every getBytes() call. I have changed 82 files to use
getBytes(utf8). Each files uses the following Charset declaration.

    private static final Charset utf8 = Charset.forName("UTF8");

If at some future time the Charset should be changed or another
approach is decided upon, simply search for that string and
refactoring will be straightforward.

The attached file shows how I performed the search for getBytes() and
has a list of files that continue to have getBytes() because it is
called on a Text or some other kind of object. The code just prints a
list of files using getBytes(). Then I manually reviewed the files,
made changes or added the file name to the ignore list.

The code compiles and the tests run. I don't see any reason why I
should not commit my work to the trunk for v1.5.0. I don't want to
cause disharmony but I can't see the harm. And even if my change
causes some problem, wouldn't it be better to know that while v1.5.0
is still being actively developed?

Re: Setting Charset in getBytes() call.

Posted by John Vines <vi...@apache.org>.

We also need to be concerned about any string convenience classes using an
encoding scheme that still has some logical sorting (if that's an issue).

Sent from my phone, pardon the typos and brevity.
On Oct 29, 2012 9:57 PM, "Josh Elser" <jo...@gmail.com> wrote:

> I'm saying that I don't know of anything in the core API which performs a
> getBytes() on the data itself. Accumulo itself is agnostic dealing only in
> byte[]. I think we're saying the same thing..
>
> On 10/29/2012 8:54 PM, Benson Margulies wrote:
>
>> On Mon, Oct 29, 2012 at 8:46 PM, Josh Elser <jo...@gmail.com> wrote:
>>
>>> +1 Mike.
>>>
>>> 1. It would be hard for me to believe Key/Value are ever handled
>>> internally
>>> in terms of Strings, but, if such a case does exist, it would be
>>> extremely
>>> prudent to fix.
>>>
>>> 2. FWIW, the Shell does use ISO-8859-1 as its charset which is
>>> referenced by
>>> other commands [1,2]. It would be good to double check all of the other
>>> commands.
>>>
>>
>> I'm a bit lost. Any possible Java String can be rendered in UTF-8. So,
>> if you are calling String.getBytes to turn a string into some bytes
>> for some purpose, I think you need UTF-8.
>>
>> On the other hand, as Mike pointed out, new String(somebytes, "utf-8")
>> will destroy data for some byte values that are not, in fact, UTF-8.
>> By why would Accumulo ever need to string-ify some array of bytes of
>> uncertain parentage?
>>
>>
>>
>>> [1]
>>> https://github.com/apache/**accumulo/blob/trunk/core/src/**
>>> main/java/org/apache/accumulo/**core/util/shell/Shell.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java>
>>> [2]
>>> https://github.com/apache/**accumulo/blob/trunk/core/src/**
>>> main/java/org/apache/accumulo/**core/util/shell/commands/**
>>> InsertCommand.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java>
>>>
>>>
>>> On 10/29/2012 8:27 PM, Michael Flester wrote:
>>>
>>>>
>>>> I agree with Benson entirely with one caveat. It seems to me that there
>>>> might be two categories of things being discussed
>>>>
>>>>     1. User data (keys and values)
>>>>     2. Ancillary things needed for operation of Accumulo (passwords).
>>>>
>>>> These could well be considered separately. Trying to do anything with
>>>> keys and values other than treating them as bytes all of the time
>>>> I find quite scary.
>>>>
>>>> And if this is only being done to satisfy pmd or findbugs, those tools
>>>> can be convinced to modify their reporting about this issue.
>>>>
>>>>
>>>

Re: Setting Charset in getBytes() call.

Posted by Josh Elser <jo...@gmail.com>.

I'm saying that I don't know of anything in the core API which performs 
a getBytes() on the data itself. Accumulo itself is agnostic dealing 
only in byte[]. I think we're saying the same thing..

On 10/29/2012 8:54 PM, Benson Margulies wrote:
> On Mon, Oct 29, 2012 at 8:46 PM, Josh Elser <jo...@gmail.com> wrote:
>> +1 Mike.
>>
>> 1. It would be hard for me to believe Key/Value are ever handled internally
>> in terms of Strings, but, if such a case does exist, it would be extremely
>> prudent to fix.
>>
>> 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced by
>> other commands [1,2]. It would be good to double check all of the other
>> commands.
>
> I'm a bit lost. Any possible Java String can be rendered in UTF-8. So,
> if you are calling String.getBytes to turn a string into some bytes
> for some purpose, I think you need UTF-8.
>
> On the other hand, as Mike pointed out, new String(somebytes, "utf-8")
> will destroy data for some byte values that are not, in fact, UTF-8.
> By why would Accumulo ever need to string-ify some array of bytes of
> uncertain parentage?
>
>
>>
>> [1]
>> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java
>> [2]
>> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java
>>
>>
>> On 10/29/2012 8:27 PM, Michael Flester wrote:
>>>
>>> I agree with Benson entirely with one caveat. It seems to me that there
>>> might be two categories of things being discussed
>>>
>>>     1. User data (keys and values)
>>>     2. Ancillary things needed for operation of Accumulo (passwords).
>>>
>>> These could well be considered separately. Trying to do anything with
>>> keys and values other than treating them as bytes all of the time
>>> I find quite scary.
>>>
>>> And if this is only being done to satisfy pmd or findbugs, those tools
>>> can be convinced to modify their reporting about this issue.
>>>
>>

Re: Setting Charset in getBytes() call.

Posted by Benson Margulies <bi...@gmail.com>.

On Mon, Oct 29, 2012 at 8:46 PM, Josh Elser <jo...@gmail.com> wrote:
> +1 Mike.
>
> 1. It would be hard for me to believe Key/Value are ever handled internally
> in terms of Strings, but, if such a case does exist, it would be extremely
> prudent to fix.
>
> 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced by
> other commands [1,2]. It would be good to double check all of the other
> commands.

I'm a bit lost. Any possible Java String can be rendered in UTF-8. So,
if you are calling String.getBytes to turn a string into some bytes
for some purpose, I think you need UTF-8.

On the other hand, as Mike pointed out, new String(somebytes, "utf-8")
will destroy data for some byte values that are not, in fact, UTF-8.
By why would Accumulo ever need to string-ify some array of bytes of
uncertain parentage?


>
> [1]
> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java
> [2]
> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java
>
>
> On 10/29/2012 8:27 PM, Michael Flester wrote:
>>
>> I agree with Benson entirely with one caveat. It seems to me that there
>> might be two categories of things being discussed
>>
>>    1. User data (keys and values)
>>    2. Ancillary things needed for operation of Accumulo (passwords).
>>
>> These could well be considered separately. Trying to do anything with
>> keys and values other than treating them as bytes all of the time
>> I find quite scary.
>>
>> And if this is only being done to satisfy pmd or findbugs, those tools
>> can be convinced to modify their reporting about this issue.
>>
>

Re: Setting Charset in getBytes() call.

Posted by Josh Elser <jo...@gmail.com>.

+1 Mike.

1. It would be hard for me to believe Key/Value are ever handled 
internally in terms of Strings, but, if such a case does exist, it would 
be extremely prudent to fix.

2. FWIW, the Shell does use ISO-8859-1 as its charset which is 
referenced by other commands [1,2]. It would be good to double check all 
of the other commands.

[1] 
https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java
[2] 
https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java

On 10/29/2012 8:27 PM, Michael Flester wrote:
> I agree with Benson entirely with one caveat. It seems to me that there
> might be two categories of things being discussed
>
>    1. User data (keys and values)
>    2. Ancillary things needed for operation of Accumulo (passwords).
>
> These could well be considered separately. Trying to do anything with
> keys and values other than treating them as bytes all of the time
> I find quite scary.
>
> And if this is only being done to satisfy pmd or findbugs, those tools
> can be convinced to modify their reporting about this issue.
>

Re: Setting Charset in getBytes() call.

Posted by Michael Flester <fl...@gmail.com>.

I agree with Benson entirely with one caveat. It seems to me that there
might be two categories of things being discussed

  1. User data (keys and values)
  2. Ancillary things needed for operation of Accumulo (passwords).

These could well be considered separately. Trying to do anything with
keys and values other than treating them as bytes all of the time
I find quite scary.

And if this is only being done to satisfy pmd or findbugs, those tools
can be convinced to modify their reporting about this issue.

Re: Setting Charset in getBytes() call.

Posted by David Medinets <da...@gmail.com>.

Anytime that I've encountered non-English character sets, the answer
has been to use UTF-8. I'm moving forward with that assumption since
it is safe change. If the group decides to use a different default
encoding, it will be trivial to build on the work that I've done
identifying getBytes() calls. I will post a list of files and my
methodology before a svn checkin.

On Mon, Oct 29, 2012 at 4:02 PM, Benson Margulies <bi...@gmail.com> wrote:
> On Mon, Oct 29, 2012 at 3:18 PM, John Vines <vi...@apache.org> wrote:
>> So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any
>> reason to use something beside ISO-8859-1 for the encodings?
>
> I object and caution against *any* plan that involves transcoding from
> X to UTF-16 and back where when the data is not always going to be
> valid bytes of encoding X. The only clean solution here is to have an
> API entirely in terms of bytes, and either let the user do getBytes if
> they want to store string data, or provide additional API.
>
>
>
>>
>> John
>>
>> On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <fl...@gmail.com> wrote:
>>
>>> > UTF-8 should always be present (according to the JLS), and as a
>>> multi-byte
>>> > format should be able to encode any character that you would need to.
>>> >
>>>
>>> UTF-8 cannot encode arbitrary data. All data that we store in accumulo
>>> is not characters. A safe encoding to use as a pass through when you
>>> don't know if you are dealing with characters is ISO-8859-1 since we know
>>> that we can make the round trip from bytes to string to bytes without loss.
>>>

Re: Setting Charset in getBytes() call.

Posted by Benson Margulies <bi...@gmail.com>.

On Mon, Oct 29, 2012 at 3:18 PM, John Vines <vi...@apache.org> wrote:
> So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any
> reason to use something beside ISO-8859-1 for the encodings?

I object and caution against *any* plan that involves transcoding from
X to UTF-16 and back where when the data is not always going to be
valid bytes of encoding X. The only clean solution here is to have an
API entirely in terms of bytes, and either let the user do getBytes if
they want to store string data, or provide additional API.

>
> John
>
> On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <fl...@gmail.com> wrote:
>
>> > UTF-8 should always be present (according to the JLS), and as a
>> multi-byte
>> > format should be able to encode any character that you would need to.
>> >
>>
>> UTF-8 cannot encode arbitrary data. All data that we store in accumulo
>> is not characters. A safe encoding to use as a pass through when you
>> don't know if you are dealing with characters is ISO-8859-1 since we know
>> that we can make the round trip from bytes to string to bytes without loss.
>>

Re: Setting Charset in getBytes() call.

Posted by John Vines <vi...@apache.org>.

So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any
reason to use something beside ISO-8859-1 for the encodings?

John

On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <fl...@gmail.com> wrote:

> > UTF-8 should always be present (according to the JLS), and as a
> multi-byte
> > format should be able to encode any character that you would need to.
> >
>
> UTF-8 cannot encode arbitrary data. All data that we store in accumulo
> is not characters. A safe encoding to use as a pass through when you
> don't know if you are dealing with characters is ISO-8859-1 since we know
> that we can make the round trip from bytes to string to bytes without loss.
>

Re: Setting Charset in getBytes() call.

Posted by Michael Flester <fl...@gmail.com>.

> UTF-8 should always be present (according to the JLS), and as a multi-byte
> format should be able to encode any character that you would need to.
>

UTF-8 cannot encode arbitrary data. All data that we store in accumulo
is not characters. A safe encoding to use as a pass through when you
don't know if you are dealing with characters is ISO-8859-1 since we know
that we can make the round trip from bytes to string to bytes without loss.

Re: Setting Charset in getBytes() call.

Posted by Mike Drob <md...@mdrob.com>.

One specific use case is when creating a new connection, the password is
passed as a byte[], when I expect most sane applications will treat it as a
String (either via reading it from a file, or reading it from a terminal
input). If somebody creates the password with a different platform encoding
than what the programmer expects, then it will cause a lock out that is
very difficult to debug.

On topic to the original question, if anybody is brave enough to use Java
7, then there are predefined constants in the JDK -
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html

UTF-8 should always be present (according to the JLS), and as a multi-byte
format should be able to encode any character that you would need to. I've
had this conversation with Keith before, so hopefully he can weigh in on
this.

Mike



On Mon, Oct 29, 2012 at 12:57 PM, Josh Elser <jo...@gmail.com> wrote:

> Benson, perhaps "contrived" would have been better than "hypothetical" :).
> That being said, I also hadn't thought about other JVM implementations.
>
> I wonder if leaving a commented note in the accumulo-env.sh script for
> alternative namings for the "file.encoding" name and the JVM it applies to
> would be sufficient?
>
> David, can you give some sort of feel for the usages of the getBytes()
> calls? Since most of the API deals with things in terms of Text and byte[]
> (Key and Value decomposed), are most of the usages configuration/user-input
> based as your initial snippet from InputFormatBase showed?
>
>
> On 10/29/2012 12:42 PM, John Vines wrote:
>
>> Are there any experts when it comes to character encodings? First of all,
>> I
>> would like to make sure there are no sacrifices being made by forcing
>> UTF-8.
>>
>>  From there, if I think JVM properties is the way to go. Should there be
>> ANY
>> sort of shortfall with UTF-8, we should allow users to switch the encoding
>> to the type of their pleasure. We can tweak the scripts to set the jvm
>> property but still allow users to override should they need it in their
>> setup. This allows us to not only avoid a massive code change, it also
>> makes it easier for users to switch to an encoding should they have a need
>> to.
>>
>> John
>>
>> On Mon, Oct 29, 2012 at 12:24 PM, Benson Margulies <bimargulies@gmail.com
>> >wrote:
>>
>>  On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <jo...@gmail.com>
>>> wrote:
>>>
>>>> David, I beg to differ.
>>>>
>>>> Setting it via the JVM property is a single change to make, whereas if
>>>>
>>> you
>>>
>>>> change every single usage of getBytes(), you now forced the next person
>>>>
>>> to
>>>
>>>> branch the code, change everything to UTF16 (hypothetical use case) and
>>>> continue a diverged codebase forever.
>>>>
>>>
>>> Typically, the reason(s) that people don't take this approach are:
>>>
>>> a: a fear that other JVMs don't have this parameter, or don't have it
>>> under the same name.
>>> b: a desire to read or write files for uses in 'the platform encoding'
>>> whatever it is, in addition to whatever needs to be done in UTF-8.
>>>
>>> I'd be very surprised if Accumulo ever decided to do this sort of
>>> thing in UTF-16.
>>>
>>>
>>>
>>>> I would say that the reason that such a JVM property exists is to
>>>>
>>> alleviate
>>>
>>>> you from having to make these code changes in the first place.
>>>>
>>>> On 10/29/2012 12:00 PM, David Medinets wrote:
>>>>
>>>>>
>>>>> I like the idea of making the change explicit in the source code.
>>>>> Setting the encoding in the jvm property would be easier but not as
>>>>> explicit. I have a few dozen of the files changed. Today I have free
>>>>> time since Hurricane Sandy has closed offices.
>>>>>
>>>>> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum
>>>>> <wilhelm.von.cloud@accumulo.**net <wi...@accumulo.net>>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Isn't it easier to just set the JVM property `file.encoding`?
>>>>>>
>>>>>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ek...@gmail.com>
>>>>>>
>>>>> wrote:
>>>
>>>>
>>>>>>  If you use a private static field in each class for the charset, it
>>>>>>>
>>>>>> will
>>>
>>>> basically be a singleton because charsets are cached in char
>>>>>>> set.forname.
>>>>>>> IMHO this is a somewhat cleaner approach than having lots of static
>>>>>>> imports
>>>>>>> to utility classes with lots of constants in them.
>>>>>>> On Oct 28, 2012 5:50 PM, "David Medinets" <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>  https://issues.apache.org/**jira/browse/ACCUMULO-241?**
>>> focusedCommentId=13449680&**page=com.atlassian.jira.**
>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13449680<https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680>
>>>
>>>>
>>>>>>>>
>>>>>>>> In this comment, John mentioned that all getBytes() method calls
>>>>>>>> should be changed to use UTF8. There are about 1,800 getBytes()
>>>>>>>> calls
>>>>>>>> and not all of them involve String objects. I am working on ways to
>>>>>>>> identify a subset of these calls to change.
>>>>>>>>
>>>>>>>> I have created https://issues.apache.org/**jira/browse/ACCUMULO-836<https://issues.apache.org/jira/browse/ACCUMULO-836>to
>>>>>>>> track this issue.
>>>>>>>>
>>>>>>>> Should we create one static Charset object?
>>>>>>>>
>>>>>>>>     Class AccumuloDefaultCharset {
>>>>>>>>       public static Charset UTF8 = Charset.forName("UTF8");
>>>>>>>>     }
>>>>>>>>
>>>>>>>> Should we use a static constant?
>>>>>>>>
>>>>>>>>     public static String UTF8 = "UTF8";
>>>>>>>>
>>>>>>>> I have found one instance of getBytes() in InputFormatBase:
>>>>>>>>
>>>>>>>>     protected static byte[] getPassword(Configuration conf) {
>>>>>>>>       return Base64.decodeBase64(conf.get(**PASSWORD,
>>>>>>>> "").getBytes());
>>>>>>>>     }
>>>>>>>>
>>>>>>>> Are there any reasons why I can't start specifying the charset? Is
>>>>>>>> UTF8 the right Charset to use? I am not an expert in non-English
>>>>>>>> charsets, so guidance would be welcome.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>
>>

Re: Setting Charset in getBytes() call.

Posted by Josh Elser <jo...@gmail.com>.

Benson, perhaps "contrived" would have been better than "hypothetical" 
:). That being said, I also hadn't thought about other JVM implementations.

I wonder if leaving a commented note in the accumulo-env.sh script for 
alternative namings for the "file.encoding" name and the JVM it applies 
to would be sufficient?

David, can you give some sort of feel for the usages of the getBytes() 
calls? Since most of the API deals with things in terms of Text and 
byte[] (Key and Value decomposed), are most of the usages 
configuration/user-input based as your initial snippet from 
InputFormatBase showed?

On 10/29/2012 12:42 PM, John Vines wrote:
> Are there any experts when it comes to character encodings? First of all, I
> would like to make sure there are no sacrifices being made by forcing UTF-8.
>
>  From there, if I think JVM properties is the way to go. Should there be ANY
> sort of shortfall with UTF-8, we should allow users to switch the encoding
> to the type of their pleasure. We can tweak the scripts to set the jvm
> property but still allow users to override should they need it in their
> setup. This allows us to not only avoid a massive code change, it also
> makes it easier for users to switch to an encoding should they have a need
> to.
>
> John
>
> On Mon, Oct 29, 2012 at 12:24 PM, Benson Margulies <bi...@gmail.com>wrote:
>
>> On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <jo...@gmail.com> wrote:
>>> David, I beg to differ.
>>>
>>> Setting it via the JVM property is a single change to make, whereas if
>> you
>>> change every single usage of getBytes(), you now forced the next person
>> to
>>> branch the code, change everything to UTF16 (hypothetical use case) and
>>> continue a diverged codebase forever.
>>
>> Typically, the reason(s) that people don't take this approach are:
>>
>> a: a fear that other JVMs don't have this parameter, or don't have it
>> under the same name.
>> b: a desire to read or write files for uses in 'the platform encoding'
>> whatever it is, in addition to whatever needs to be done in UTF-8.
>>
>> I'd be very surprised if Accumulo ever decided to do this sort of
>> thing in UTF-16.
>>
>>
>>>
>>> I would say that the reason that such a JVM property exists is to
>> alleviate
>>> you from having to make these code changes in the first place.
>>>
>>> On 10/29/2012 12:00 PM, David Medinets wrote:
>>>>
>>>> I like the idea of making the change explicit in the source code.
>>>> Setting the encoding in the jvm property would be easier but not as
>>>> explicit. I have a few dozen of the files changed. Today I have free
>>>> time since Hurricane Sandy has closed offices.
>>>>
>>>> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum
>>>> <wi...@accumulo.net> wrote:
>>>>>
>>>>> Isn't it easier to just set the JVM property `file.encoding`?
>>>>>
>>>>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ek...@gmail.com>
>> wrote:
>>>>>
>>>>>> If you use a private static field in each class for the charset, it
>> will
>>>>>> basically be a singleton because charsets are cached in char
>>>>>> set.forname.
>>>>>> IMHO this is a somewhat cleaner approach than having lots of static
>>>>>> imports
>>>>>> to utility classes with lots of constants in them.
>>>>>> On Oct 28, 2012 5:50 PM, "David Medinets" <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
>>>>>>>
>>>>>>>
>>>>>>> In this comment, John mentioned that all getBytes() method calls
>>>>>>> should be changed to use UTF8. There are about 1,800 getBytes() calls
>>>>>>> and not all of them involve String objects. I am working on ways to
>>>>>>> identify a subset of these calls to change.
>>>>>>>
>>>>>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
>>>>>>> track this issue.
>>>>>>>
>>>>>>> Should we create one static Charset object?
>>>>>>>
>>>>>>>     Class AccumuloDefaultCharset {
>>>>>>>       public static Charset UTF8 = Charset.forName("UTF8");
>>>>>>>     }
>>>>>>>
>>>>>>> Should we use a static constant?
>>>>>>>
>>>>>>>     public static String UTF8 = "UTF8";
>>>>>>>
>>>>>>> I have found one instance of getBytes() in InputFormatBase:
>>>>>>>
>>>>>>>     protected static byte[] getPassword(Configuration conf) {
>>>>>>>       return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes());
>>>>>>>     }
>>>>>>>
>>>>>>> Are there any reasons why I can't start specifying the charset? Is
>>>>>>> UTF8 the right Charset to use? I am not an expert in non-English
>>>>>>> charsets, so guidance would be welcome.
>>>>>>>
>>>>>>
>>>
>>
>

Re: Setting Charset in getBytes() call.

Posted by John Vines <vi...@apache.org>.

Are there any experts when it comes to character encodings? First of all, I
would like to make sure there are no sacrifices being made by forcing UTF-8.

>From there, if I think JVM properties is the way to go. Should there be ANY
sort of shortfall with UTF-8, we should allow users to switch the encoding
to the type of their pleasure. We can tweak the scripts to set the jvm
property but still allow users to override should they need it in their
setup. This allows us to not only avoid a massive code change, it also
makes it easier for users to switch to an encoding should they have a need
to.

John

On Mon, Oct 29, 2012 at 12:24 PM, Benson Margulies <bi...@gmail.com>wrote:

> On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <jo...@gmail.com> wrote:
> > David, I beg to differ.
> >
> > Setting it via the JVM property is a single change to make, whereas if
> you
> > change every single usage of getBytes(), you now forced the next person
> to
> > branch the code, change everything to UTF16 (hypothetical use case) and
> > continue a diverged codebase forever.
>
> Typically, the reason(s) that people don't take this approach are:
>
> a: a fear that other JVMs don't have this parameter, or don't have it
> under the same name.
> b: a desire to read or write files for uses in 'the platform encoding'
> whatever it is, in addition to whatever needs to be done in UTF-8.
>
> I'd be very surprised if Accumulo ever decided to do this sort of
> thing in UTF-16.
>
>
> >
> > I would say that the reason that such a JVM property exists is to
> alleviate
> > you from having to make these code changes in the first place.
> >
> > On 10/29/2012 12:00 PM, David Medinets wrote:
> >>
> >> I like the idea of making the change explicit in the source code.
> >> Setting the encoding in the jvm property would be easier but not as
> >> explicit. I have a few dozen of the files changed. Today I have free
> >> time since Hurricane Sandy has closed offices.
> >>
> >> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum
> >> <wi...@accumulo.net> wrote:
> >>>
> >>> Isn't it easier to just set the JVM property `file.encoding`?
> >>>
> >>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ek...@gmail.com>
> wrote:
> >>>
> >>>> If you use a private static field in each class for the charset, it
> will
> >>>> basically be a singleton because charsets are cached in char
> >>>> set.forname.
> >>>> IMHO this is a somewhat cleaner approach than having lots of static
> >>>> imports
> >>>> to utility classes with lots of constants in them.
> >>>> On Oct 28, 2012 5:50 PM, "David Medinets" <da...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
> >>>>>
> >>>>>
> >>>>> In this comment, John mentioned that all getBytes() method calls
> >>>>> should be changed to use UTF8. There are about 1,800 getBytes() calls
> >>>>> and not all of them involve String objects. I am working on ways to
> >>>>> identify a subset of these calls to change.
> >>>>>
> >>>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
> >>>>> track this issue.
> >>>>>
> >>>>> Should we create one static Charset object?
> >>>>>
> >>>>>    Class AccumuloDefaultCharset {
> >>>>>      public static Charset UTF8 = Charset.forName("UTF8");
> >>>>>    }
> >>>>>
> >>>>> Should we use a static constant?
> >>>>>
> >>>>>    public static String UTF8 = "UTF8";
> >>>>>
> >>>>> I have found one instance of getBytes() in InputFormatBase:
> >>>>>
> >>>>>    protected static byte[] getPassword(Configuration conf) {
> >>>>>      return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes());
> >>>>>    }
> >>>>>
> >>>>> Are there any reasons why I can't start specifying the charset? Is
> >>>>> UTF8 the right Charset to use? I am not an expert in non-English
> >>>>> charsets, so guidance would be welcome.
> >>>>>
> >>>>
> >
>

Re: Setting Charset in getBytes() call.

Posted by Benson Margulies <bi...@gmail.com>.

On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <jo...@gmail.com> wrote:
> David, I beg to differ.
>
> Setting it via the JVM property is a single change to make, whereas if you
> change every single usage of getBytes(), you now forced the next person to
> branch the code, change everything to UTF16 (hypothetical use case) and
> continue a diverged codebase forever.

Typically, the reason(s) that people don't take this approach are:

a: a fear that other JVMs don't have this parameter, or don't have it
under the same name.
b: a desire to read or write files for uses in 'the platform encoding'
whatever it is, in addition to whatever needs to be done in UTF-8.

I'd be very surprised if Accumulo ever decided to do this sort of
thing in UTF-16.


>
> I would say that the reason that such a JVM property exists is to alleviate
> you from having to make these code changes in the first place.
>
> On 10/29/2012 12:00 PM, David Medinets wrote:
>>
>> I like the idea of making the change explicit in the source code.
>> Setting the encoding in the jvm property would be easier but not as
>> explicit. I have a few dozen of the files changed. Today I have free
>> time since Hurricane Sandy has closed offices.
>>
>> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum
>> <wi...@accumulo.net> wrote:
>>>
>>> Isn't it easier to just set the JVM property `file.encoding`?
>>>
>>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ek...@gmail.com> wrote:
>>>
>>>> If you use a private static field in each class for the charset, it will
>>>> basically be a singleton because charsets are cached in char
>>>> set.forname.
>>>> IMHO this is a somewhat cleaner approach than having lots of static
>>>> imports
>>>> to utility classes with lots of constants in them.
>>>> On Oct 28, 2012 5:50 PM, "David Medinets" <da...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>
>>>> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
>>>>>
>>>>>
>>>>> In this comment, John mentioned that all getBytes() method calls
>>>>> should be changed to use UTF8. There are about 1,800 getBytes() calls
>>>>> and not all of them involve String objects. I am working on ways to
>>>>> identify a subset of these calls to change.
>>>>>
>>>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
>>>>> track this issue.
>>>>>
>>>>> Should we create one static Charset object?
>>>>>
>>>>>    Class AccumuloDefaultCharset {
>>>>>      public static Charset UTF8 = Charset.forName("UTF8");
>>>>>    }
>>>>>
>>>>> Should we use a static constant?
>>>>>
>>>>>    public static String UTF8 = "UTF8";
>>>>>
>>>>> I have found one instance of getBytes() in InputFormatBase:
>>>>>
>>>>>    protected static byte[] getPassword(Configuration conf) {
>>>>>      return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes());
>>>>>    }
>>>>>
>>>>> Are there any reasons why I can't start specifying the charset? Is
>>>>> UTF8 the right Charset to use? I am not an expert in non-English
>>>>> charsets, so guidance would be welcome.
>>>>>
>>>>
>

Re: Setting Charset in getBytes() call.

Posted by Josh Elser <jo...@gmail.com>.

David, I beg to differ.

Setting it via the JVM property is a single change to make, whereas if 
you change every single usage of getBytes(), you now forced the next 
person to branch the code, change everything to UTF16 (hypothetical use 
case) and continue a diverged codebase forever.

I would say that the reason that such a JVM property exists is to 
alleviate you from having to make these code changes in the first place.

On 10/29/2012 12:00 PM, David Medinets wrote:
> I like the idea of making the change explicit in the source code.
> Setting the encoding in the jvm property would be easier but not as
> explicit. I have a few dozen of the files changed. Today I have free
> time since Hurricane Sandy has closed offices.
>
> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum
> <wi...@accumulo.net> wrote:
>> Isn't it easier to just set the JVM property `file.encoding`?
>>
>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ek...@gmail.com> wrote:
>>
>>> If you use a private static field in each class for the charset, it will
>>> basically be a singleton because charsets are cached in char set.forname.
>>> IMHO this is a somewhat cleaner approach than having lots of static imports
>>> to utility classes with lots of constants in them.
>>> On Oct 28, 2012 5:50 PM, "David Medinets" <da...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
>>>>
>>>> In this comment, John mentioned that all getBytes() method calls
>>>> should be changed to use UTF8. There are about 1,800 getBytes() calls
>>>> and not all of them involve String objects. I am working on ways to
>>>> identify a subset of these calls to change.
>>>>
>>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
>>>> track this issue.
>>>>
>>>> Should we create one static Charset object?
>>>>
>>>>    Class AccumuloDefaultCharset {
>>>>      public static Charset UTF8 = Charset.forName("UTF8");
>>>>    }
>>>>
>>>> Should we use a static constant?
>>>>
>>>>    public static String UTF8 = "UTF8";
>>>>
>>>> I have found one instance of getBytes() in InputFormatBase:
>>>>
>>>>    protected static byte[] getPassword(Configuration conf) {
>>>>      return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes());
>>>>    }
>>>>
>>>> Are there any reasons why I can't start specifying the charset? Is
>>>> UTF8 the right Charset to use? I am not an expert in non-English
>>>> charsets, so guidance would be welcome.
>>>>
>>>

Re: Setting Charset in getBytes() call.

Posted by David Medinets <da...@gmail.com>.

I like the idea of making the change explicit in the source code.
Setting the encoding in the jvm property would be easier but not as
explicit. I have a few dozen of the files changed. Today I have free
time since Hurricane Sandy has closed offices.

On Mon, Oct 29, 2012 at 11:39 AM, William Slacum
<wi...@accumulo.net> wrote:
> Isn't it easier to just set the JVM property `file.encoding`?
>
> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ek...@gmail.com> wrote:
>
>> If you use a private static field in each class for the charset, it will
>> basically be a singleton because charsets are cached in char set.forname.
>> IMHO this is a somewhat cleaner approach than having lots of static imports
>> to utility classes with lots of constants in them.
>> On Oct 28, 2012 5:50 PM, "David Medinets" <da...@gmail.com>
>> wrote:
>>
>> >
>> >
>> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
>> >
>> > In this comment, John mentioned that all getBytes() method calls
>> > should be changed to use UTF8. There are about 1,800 getBytes() calls
>> > and not all of them involve String objects. I am working on ways to
>> > identify a subset of these calls to change.
>> >
>> > I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
>> > track this issue.
>> >
>> > Should we create one static Charset object?
>> >
>> >   Class AccumuloDefaultCharset {
>> >     public static Charset UTF8 = Charset.forName("UTF8");
>> >   }
>> >
>> > Should we use a static constant?
>> >
>> >   public static String UTF8 = "UTF8";
>> >
>> > I have found one instance of getBytes() in InputFormatBase:
>> >
>> >   protected static byte[] getPassword(Configuration conf) {
>> >     return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes());
>> >   }
>> >
>> > Are there any reasons why I can't start specifying the charset? Is
>> > UTF8 the right Charset to use? I am not an expert in non-English
>> > charsets, so guidance would be welcome.
>> >
>>

Re: Setting Charset in getBytes() call.

Posted by William Slacum <wi...@accumulo.net>.

Isn't it easier to just set the JVM property `file.encoding`?

On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ek...@gmail.com> wrote:

> If you use a private static field in each class for the charset, it will
> basically be a singleton because charsets are cached in char set.forname.
> IMHO this is a somewhat cleaner approach than having lots of static imports
> to utility classes with lots of constants in them.
> On Oct 28, 2012 5:50 PM, "David Medinets" <da...@gmail.com>
> wrote:
>
> >
> >
> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
> >
> > In this comment, John mentioned that all getBytes() method calls
> > should be changed to use UTF8. There are about 1,800 getBytes() calls
> > and not all of them involve String objects. I am working on ways to
> > identify a subset of these calls to change.
> >
> > I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
> > track this issue.
> >
> > Should we create one static Charset object?
> >
> >   Class AccumuloDefaultCharset {
> >     public static Charset UTF8 = Charset.forName("UTF8");
> >   }
> >
> > Should we use a static constant?
> >
> >   public static String UTF8 = "UTF8";
> >
> > I have found one instance of getBytes() in InputFormatBase:
> >
> >   protected static byte[] getPassword(Configuration conf) {
> >     return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes());
> >   }
> >
> > Are there any reasons why I can't start specifying the charset? Is
> > UTF8 the right Charset to use? I am not an expert in non-English
> > charsets, so guidance would be welcome.
> >
>

Re: Setting Charset in getBytes() call.

Posted by Ed Kohlwey <ek...@gmail.com>.

If you use a private static field in each class for the charset, it will
basically be a singleton because charsets are cached in char set.forname.
IMHO this is a somewhat cleaner approach than having lots of static imports
to utility classes with lots of constants in them.
On Oct 28, 2012 5:50 PM, "David Medinets" <da...@gmail.com> wrote:

>
> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
>
> In this comment, John mentioned that all getBytes() method calls
> should be changed to use UTF8. There are about 1,800 getBytes() calls
> and not all of them involve String objects. I am working on ways to
> identify a subset of these calls to change.
>
> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
> track this issue.
>
> Should we create one static Charset object?
>
>   Class AccumuloDefaultCharset {
>     public static Charset UTF8 = Charset.forName("UTF8");
>   }
>
> Should we use a static constant?
>
>   public static String UTF8 = "UTF8";
>
> I have found one instance of getBytes() in InputFormatBase:
>
>   protected static byte[] getPassword(Configuration conf) {
>     return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes());
>   }
>
> Are there any reasons why I can't start specifying the charset? Is
> UTF8 the right Charset to use? I am not an expert in non-English
> charsets, so guidance would be welcome.
>