You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by Drew Farris <dr...@apache.org> on 2013/05/06 00:49:46 UTC

Shell Charset?

In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
getEndRow, use the following snippet to read their arguments:

new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));

Here, Shell.CHARSET is set to ISO-8859-1

This seems to mean that if I use UTF-8 characters (unescaped) from the
shell to set my begin or end row, that I will not get what I expect because
the conversion from String to bytes would be performed using the incorrect
character set.

For example, in the following snippet, testIso fails while testUTF succeeds
(when the encoding of the source file is UTF-8):


  @Test

  public void testISO() throws Exception {

    String s = "本条目是介紹";

    String charset = "ISO-8859-1";

    Text t = new Text(s.getBytes(charset));

    Assert.assertEquals(s, t.toString());

  }


  @Test

  public void testUTF() throws Exception {

    String s = "本条目是介紹";

    String charset = "UTF-8";

    Text t = new Text(s.getBytes(charset));

    Assert.assertEquals(s, t.toString());

  }


Possibly this should be locale dependent behavior? Also, perhaps I'm
missing the fact that the Shell is not supposed to support UTF-8 characters
in start and end ranges, and users must escape their strings appropriately.
(Which would be a bit of a pain).


- Drew

Re: Shell Charset?

Posted by John Vines <vi...@apache.org>.

Sounds like we should grep through the codebase and make sure the only
charset we're using is UTF-8...
10


On Sun, May 5, 2013 at 8:08 PM, Christopher <ct...@apache.org> wrote:

> The shell should accept java "String" from the the console (leaving
> the job of converting input bytes to a java String argument to the
> locale-dependent console), and should only translate them to UTF-8
> when it sends it to Accumulo, I think.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Sun, May 5, 2013 at 6:49 PM, Drew Farris <dr...@apache.org> wrote:
> > In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
> > getEndRow, use the following snippet to read their arguments:
> >
> > new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));
> >
> > Here, Shell.CHARSET is set to ISO-8859-1
> >
> > This seems to mean that if I use UTF-8 characters (unescaped) from the
> > shell to set my begin or end row, that I will not get what I expect
> because
> > the conversion from String to bytes would be performed using the
> incorrect
> > character set.
> >
> > For example, in the following snippet, testIso fails while testUTF
> succeeds
> > (when the encoding of the source file is UTF-8):
> >
> >
> >   @Test
> >
> >   public void testISO() throws Exception {
> >
> >     String s = "本条目是介紹";
> >
> >     String charset = "ISO-8859-1";
> >
> >     Text t = new Text(s.getBytes(charset));
> >
> >     Assert.assertEquals(s, t.toString());
> >
> >   }
> >
> >
> >   @Test
> >
> >   public void testUTF() throws Exception {
> >
> >     String s = "本条目是介紹";
> >
> >     String charset = "UTF-8";
> >
> >     Text t = new Text(s.getBytes(charset));
> >
> >     Assert.assertEquals(s, t.toString());
> >
> >   }
> >
> >
> > Possibly this should be locale dependent behavior? Also, perhaps I'm
> > missing the fact that the Shell is not supposed to support UTF-8
> characters
> > in start and end ranges, and users must escape their strings
> appropriately.
> > (Which would be a bit of a pain).
> >
> >
> > - Drew
>

Re: Shell Charset?

Posted by Christopher <ct...@apache.org>.

The shell should accept java "String" from the the console (leaving
the job of converting input bytes to a java String argument to the
locale-dependent console), and should only translate them to UTF-8
when it sends it to Accumulo, I think.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Sun, May 5, 2013 at 6:49 PM, Drew Farris <dr...@apache.org> wrote:
> In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
> getEndRow, use the following snippet to read their arguments:
>
> new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));
>
> Here, Shell.CHARSET is set to ISO-8859-1
>
> This seems to mean that if I use UTF-8 characters (unescaped) from the
> shell to set my begin or end row, that I will not get what I expect because
> the conversion from String to bytes would be performed using the incorrect
> character set.
>
> For example, in the following snippet, testIso fails while testUTF succeeds
> (when the encoding of the source file is UTF-8):
>
>
>   @Test
>
>   public void testISO() throws Exception {
>
>     String s = "本条目是介紹";
>
>     String charset = "ISO-8859-1";
>
>     Text t = new Text(s.getBytes(charset));
>
>     Assert.assertEquals(s, t.toString());
>
>   }
>
>
>   @Test
>
>   public void testUTF() throws Exception {
>
>     String s = "本条目是介紹";
>
>     String charset = "UTF-8";
>
>     Text t = new Text(s.getBytes(charset));
>
>     Assert.assertEquals(s, t.toString());
>
>   }
>
>
> Possibly this should be locale dependent behavior? Also, perhaps I'm
> missing the fact that the Shell is not supposed to support UTF-8 characters
> in start and end ranges, and users must escape their strings appropriately.
> (Which would be a bit of a pain).
>
>
> - Drew

Re: Shell Charset?

Posted by Keith Turner <ke...@deenlo.com>.

On Mon, May 6, 2013 at 2:49 PM, Josh Elser <jo...@gmail.com> wrote:

> Would a better long-term solution be to just deal with it in a new shell
> that actually supports all sorts of constructs outside of the current shell
> commands?
>
> I'm thinking of Python where you have the ability to specify things like
> u'\0000'. The proxy would certainly drop the barrier of doing something
> like this.
>
> Would that be overkill to work towards in 1.6? Does this merit fixing
> sooner?


there is ACCUMULO-1045


>
>
> On 5/6/13 2:09 PM, Keith Turner wrote:
>
>> On Sun, May 5, 2013 at 6:49 PM, Drew Farris <dr...@apache.org> wrote:
>>
>>  In o.a.a.core.uti.shell.commands.**OptUtil, I notice that getStartRow
>>> and
>>> getEndRow, use the following snippet to read their arguments:
>>>
>>> new Text(cl.getOptionValue(END_**ROW_OPT).getBytes(Shell.**CHARSET));
>>>
>>> Here, Shell.CHARSET is set to ISO-8859-1
>>>
>>> This seems to mean that if I use UTF-8 characters (unescaped) from the
>>> shell to set my begin or end row, that I will not get what I expect
>>> because
>>> the conversion from String to bytes would be performed using the
>>> incorrect
>>> character set.
>>>
>>> For example, in the following snippet, testIso fails while testUTF
>>> succeeds
>>> (when the encoding of the source file is UTF-8):
>>>
>>>
>>>    @Test
>>>
>>>    public void testISO() throws Exception {
>>>
>>>      String s = "本条目是介紹";
>>>
>>>      String charset = "ISO-8859-1";
>>>
>>>      Text t = new Text(s.getBytes(charset));
>>>
>>>      Assert.assertEquals(s, t.toString());
>>>
>>>    }
>>>
>>>
>>>    @Test
>>>
>>>    public void testUTF() throws Exception {
>>>
>>>      String s = "本条目是介紹";
>>>
>>>      String charset = "UTF-8";
>>>
>>>      Text t = new Text(s.getBytes(charset));
>>>
>>>      Assert.assertEquals(s, t.toString());
>>>
>>>    }
>>>
>>>
>>> Possibly this should be locale dependent behavior? Also, perhaps I'm
>>> missing the fact that the Shell is not supposed to support UTF-8
>>> characters
>>> in start and end ranges, and users must escape their strings
>>> appropriately.
>>> (Which would be a bit of a pain).
>>>
>>>  I think the way the shell is written, it pushes binary data (that may
>> not
>> be UTF-8) through strings.  I think at some point the \xNN escape codes
>> are
>> converted to binary and this data is pushed back into a String.
>>    ISO-8859-1 ensures this works.   Ideally the shell would not do this.
>>
>>
>>
>>> - Drew
>>>
>>>
>

Re: Shell Charset?

Posted by Josh Elser <jo...@gmail.com>.

Would a better long-term solution be to just deal with it in a new shell 
that actually supports all sorts of constructs outside of the current 
shell commands?

I'm thinking of Python where you have the ability to specify things like 
u'\0000'. The proxy would certainly drop the barrier of doing something 
like this.

Would that be overkill to work towards in 1.6? Does this merit fixing 
sooner?

On 5/6/13 2:09 PM, Keith Turner wrote:
> On Sun, May 5, 2013 at 6:49 PM, Drew Farris <dr...@apache.org> wrote:
>
>> In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
>> getEndRow, use the following snippet to read their arguments:
>>
>> new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));
>>
>> Here, Shell.CHARSET is set to ISO-8859-1
>>
>> This seems to mean that if I use UTF-8 characters (unescaped) from the
>> shell to set my begin or end row, that I will not get what I expect because
>> the conversion from String to bytes would be performed using the incorrect
>> character set.
>>
>> For example, in the following snippet, testIso fails while testUTF succeeds
>> (when the encoding of the source file is UTF-8):
>>
>>
>>    @Test
>>
>>    public void testISO() throws Exception {
>>
>>      String s = "本条目是介紹";
>>
>>      String charset = "ISO-8859-1";
>>
>>      Text t = new Text(s.getBytes(charset));
>>
>>      Assert.assertEquals(s, t.toString());
>>
>>    }
>>
>>
>>    @Test
>>
>>    public void testUTF() throws Exception {
>>
>>      String s = "本条目是介紹";
>>
>>      String charset = "UTF-8";
>>
>>      Text t = new Text(s.getBytes(charset));
>>
>>      Assert.assertEquals(s, t.toString());
>>
>>    }
>>
>>
>> Possibly this should be locale dependent behavior? Also, perhaps I'm
>> missing the fact that the Shell is not supposed to support UTF-8 characters
>> in start and end ranges, and users must escape their strings appropriately.
>> (Which would be a bit of a pain).
>>
> I think the way the shell is written, it pushes binary data (that may not
> be UTF-8) through strings.  I think at some point the \xNN escape codes are
> converted to binary and this data is pushed back into a String.
>    ISO-8859-1 ensures this works.   Ideally the shell would not do this.
>
>
>>
>> - Drew
>>

Re: Shell Charset?

Posted by Keith Turner <ke...@deenlo.com>.

On Sun, May 5, 2013 at 6:49 PM, Drew Farris <dr...@apache.org> wrote:

> In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
> getEndRow, use the following snippet to read their arguments:
>
> new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));
>
> Here, Shell.CHARSET is set to ISO-8859-1
>
> This seems to mean that if I use UTF-8 characters (unescaped) from the
> shell to set my begin or end row, that I will not get what I expect because
> the conversion from String to bytes would be performed using the incorrect
> character set.
>
> For example, in the following snippet, testIso fails while testUTF succeeds
> (when the encoding of the source file is UTF-8):
>
>
>   @Test
>
>   public void testISO() throws Exception {
>
>     String s = "本条目是介紹";
>
>     String charset = "ISO-8859-1";
>
>     Text t = new Text(s.getBytes(charset));
>
>     Assert.assertEquals(s, t.toString());
>
>   }
>
>
>   @Test
>
>   public void testUTF() throws Exception {
>
>     String s = "本条目是介紹";
>
>     String charset = "UTF-8";
>
>     Text t = new Text(s.getBytes(charset));
>
>     Assert.assertEquals(s, t.toString());
>
>   }
>
>
> Possibly this should be locale dependent behavior? Also, perhaps I'm
> missing the fact that the Shell is not supposed to support UTF-8 characters
> in start and end ranges, and users must escape their strings appropriately.
> (Which would be a bit of a pain).
>

I think the way the shell is written, it pushes binary data (that may not
be UTF-8) through strings.  I think at some point the \xNN escape codes are
converted to binary and this data is pushed back into a String.
  ISO-8859-1 ensures this works.   Ideally the shell would not do this.


>
>
> - Drew
>