You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Charles Givre <cg...@gmail.com> on 2018/08/07 14:02:34 UTC

Interesting Errors

Hello Drill Developers,
I wanted to share an interesting development that happened yesterday.  I was teaching a class at BlackHat, and we have a worksheet that includes a Drill demonstration using PyDrill.   Basically the students are asked to execute a query in Drill using PyDrill then visualize the results.  

Anyway, a student from Japan tried this, and was getting all kinds of crazy errors.  So I sat down and worked with him to debug.  It turns out that the period on the Japanese keyboard, maps to a different unicode character than on US keyboards, and hence the queries throw errors.  I discovered this because when I would cut/paste a query from a text file that I wrote, the query executed, but if we typed one in, it broke.  After digging around a bit, I found that it was the period character.

I’m not sure that this can or should be fixed, but I wanted to let people know about this. 

Best,
— C

Re: Interesting Errors

Posted by Vova Vysotskyi <vv...@gmail.com>.
Hi Charles,

Thanks for the interesting story!

I think, specifying the correct charset may help to solve this problem. For
more details please refer DRILL-5772
<https://issues.apache.org/jira/browse/DRILL-5772>.

Kind regards,
Volodymyr Vysotskyi


On Tue, Aug 7, 2018 at 5:02 PM Charles Givre <cg...@gmail.com> wrote:

> Hello Drill Developers,
> I wanted to share an interesting development that happened yesterday.  I
> was teaching a class at BlackHat, and we have a worksheet that includes a
> Drill demonstration using PyDrill.   Basically the students are asked to
> execute a query in Drill using PyDrill then visualize the results.
>
> Anyway, a student from Japan tried this, and was getting all kinds of
> crazy errors.  So I sat down and worked with him to debug.  It turns out
> that the period on the Japanese keyboard, maps to a different unicode
> character than on US keyboards, and hence the queries throw errors.  I
> discovered this because when I would cut/paste a query from a text file
> that I wrote, the query executed, but if we typed one in, it broke.  After
> digging around a bit, I found that it was the period character.
>
> I’m not sure that this can or should be fixed, but I wanted to let people
> know about this.
>
> Best,
> — C

Re: Interesting Errors

Posted by Ted Dunning <te...@gmail.com>.
The core problem is that Japanese has four sets of characters that are
commonly used for writing text, but only one set of punctuation. These are
Kanji, Hiragana, Katakana and roman characters. Kanji are derived from
Chinese Han ideo graphs, hiragana derives from a phonetic courtly language
and is used mostly for morphology and particles while katakana is used
mostly to transliterate foreign words. There is also a tradition of
"romaji" where words (foreign or Japanese) are written using Latin
characters for different kinds of emphasis. To make things more fun,
Japanese character sets typically include multiple copies of roman
characters, some set very wide to match the spacing of kanji characters.

The result can be absolute mayhem when it comes to ASCII centric
traditions. Computer languages are difficult. Domain names are disastrous
(because visually indistinguishable variants are possible for hacking).

When programming, Japanese native speakers have to learn a lot of nuances
like which full stop character to use.  Not fun or easy.



On Tue, Aug 7, 2018 at 1:48 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> As far as I can tell, the SQL grammar is based on the ASCII subset of
> Unicode, meaning that SQL allows only the "full stop" (.) character, but
> not the Unicode variations.
>
> Since the SQL standard defines the meaning of the full stop, Drill
> probably does not want to allow variations.
>
> It seems various vendors have come up with different markers for Unicode
> characters in constants or names. (I believe Drill uses UTF-8 as its native
> character set, so none of these hacks are necessary.)
>
> Perhaps either Python does allow variations, or the editor was doing the
> translation?
>
> Although the various Drill tools could try to play this trick, the problem
> would be knowing when to do the substitution (outside of constants and
> names) and when to leave the characters alone (when quoted.)
>
> I wonder, how do uses of such keyboards handle the case of editing, say, C
> code which has an old-school grammar based on ASCII for its tokens? Must be
> some standard technique.
>
> Thanks,
> - Paul
>
>
>     On Tuesday, August 7, 2018, 7:22:37 AM PDT, Charles Givre <
> cgivre@gmail.com> wrote:
>
>  That was the interesting part.  The python code that was using that
> character wasn’t seeming to fail. It was the quoted query that was being
> sent to Drill that was failing.
>
> Sent from my iPhone
>
> > On Aug 7, 2018, at 07:20, Pradeeban Kathiravelu <kk...@gmail.com>
> wrote:
> >
> > If I understand correctly, he was using the character ・
> >
> > Different languages have different symbols.
> >
> > The same thing can be said about the Chinese equivalent 。
> >
> > The characters ・and 。 are entirely different from the "."
> > Not sure whether this needs to be fixed. If I am not completely
> > misunderstanding something, he will also fail if he attempts to use that
> > character (in place of ".") in programming languages (Java, C, ..).
> >
> > Regards,
> > Pradeeban.
> >
> >> On Tue, Aug 7, 2018 at 10:02 AM, Charles Givre <cg...@gmail.com>
> wrote:
> >>
> >> Hello Drill Developers,
> >> I wanted to share an interesting development that happened yesterday.  I
> >> was teaching a class at BlackHat, and we have a worksheet that includes
> a
> >> Drill demonstration using PyDrill.  Basically the students are asked to
> >> execute a query in Drill using PyDrill then visualize the results.
> >>
> >> Anyway, a student from Japan tried this, and was getting all kinds of
> >> crazy errors.  So I sat down and worked with him to debug.  It turns out
> >> that the period on the Japanese keyboard, maps to a different unicode
> >> character than on US keyboards, and hence the queries throw errors.  I
> >> discovered this because when I would cut/paste a query from a text file
> >> that I wrote, the query executed, but if we typed one in, it broke.
> After
> >> digging around a bit, I found that it was the period character.
> >>
> >> I’m not sure that this can or should be fixed, but I wanted to let
> people
> >> know about this.
> >>
> >> Best,
> >> — C
> >
> >
> >
> >
> > --
> > Pradeeban Kathiravelu.
> > Senior Systems Software Engineer, Emory University, Atlanta, GA, USA.
> > Ph.D. Researcher, Erasmus Mundus Joint Doctorate in Distributed
> Computing,
> > INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
> > Portugal.
> > Université catholique de Louvain, Louvain-la-Neuve, Belgium.
> >
> > Blog: [Llovizna] kkpradeeban.blogspot.com
> > LinkedIn: www.linkedin.com/in/kpradeeban

Re: Interesting Errors

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
As far as I can tell, the SQL grammar is based on the ASCII subset of Unicode, meaning that SQL allows only the "full stop" (.) character, but not the Unicode variations.

Since the SQL standard defines the meaning of the full stop, Drill probably does not want to allow variations.

It seems various vendors have come up with different markers for Unicode characters in constants or names. (I believe Drill uses UTF-8 as its native character set, so none of these hacks are necessary.)

Perhaps either Python does allow variations, or the editor was doing the translation?

Although the various Drill tools could try to play this trick, the problem would be knowing when to do the substitution (outside of constants and names) and when to leave the characters alone (when quoted.)

I wonder, how do uses of such keyboards handle the case of editing, say, C code which has an old-school grammar based on ASCII for its tokens? Must be some standard technique.

Thanks,
- Paul
 

    On Tuesday, August 7, 2018, 7:22:37 AM PDT, Charles Givre <cg...@gmail.com> wrote:  
 
 That was the interesting part.  The python code that was using that character wasn’t seeming to fail. It was the quoted query that was being sent to Drill that was failing.  

Sent from my iPhone

> On Aug 7, 2018, at 07:20, Pradeeban Kathiravelu <kk...@gmail.com> wrote:
> 
> If I understand correctly, he was using the character ・
> 
> Different languages have different symbols.
> 
> The same thing can be said about the Chinese equivalent 。
> 
> The characters ・and 。 are entirely different from the "."
> Not sure whether this needs to be fixed. If I am not completely
> misunderstanding something, he will also fail if he attempts to use that
> character (in place of ".") in programming languages (Java, C, ..).
> 
> Regards,
> Pradeeban.
> 
>> On Tue, Aug 7, 2018 at 10:02 AM, Charles Givre <cg...@gmail.com> wrote:
>> 
>> Hello Drill Developers,
>> I wanted to share an interesting development that happened yesterday.  I
>> was teaching a class at BlackHat, and we have a worksheet that includes a
>> Drill demonstration using PyDrill.  Basically the students are asked to
>> execute a query in Drill using PyDrill then visualize the results.
>> 
>> Anyway, a student from Japan tried this, and was getting all kinds of
>> crazy errors.  So I sat down and worked with him to debug.  It turns out
>> that the period on the Japanese keyboard, maps to a different unicode
>> character than on US keyboards, and hence the queries throw errors.  I
>> discovered this because when I would cut/paste a query from a text file
>> that I wrote, the query executed, but if we typed one in, it broke.  After
>> digging around a bit, I found that it was the period character.
>> 
>> I’m not sure that this can or should be fixed, but I wanted to let people
>> know about this.
>> 
>> Best,
>> — C
> 
> 
> 
> 
> -- 
> Pradeeban Kathiravelu.
> Senior Systems Software Engineer, Emory University, Atlanta, GA, USA.
> Ph.D. Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
> INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
> Portugal.
> Université catholique de Louvain, Louvain-la-Neuve, Belgium.
> 
> Blog: [Llovizna] kkpradeeban.blogspot.com
> LinkedIn: www.linkedin.com/in/kpradeeban  

Re: Interesting Errors

Posted by Charles Givre <cg...@gmail.com>.
That was the interesting part.  The python code that was using that character wasn’t seeming to fail. It was the quoted query that was being sent to Drill that was failing.  

Sent from my iPhone

> On Aug 7, 2018, at 07:20, Pradeeban Kathiravelu <kk...@gmail.com> wrote:
> 
> If I understand correctly, he was using the character ・
> 
> Different languages have different symbols.
> 
> The same thing can be said about the Chinese equivalent 。
> 
> The characters ・and 。 are entirely different from the "."
> Not sure whether this needs to be fixed. If I am not completely
> misunderstanding something, he will also fail if he attempts to use that
> character (in place of ".") in programming languages (Java, C, ..).
> 
> Regards,
> Pradeeban.
> 
>> On Tue, Aug 7, 2018 at 10:02 AM, Charles Givre <cg...@gmail.com> wrote:
>> 
>> Hello Drill Developers,
>> I wanted to share an interesting development that happened yesterday.  I
>> was teaching a class at BlackHat, and we have a worksheet that includes a
>> Drill demonstration using PyDrill.   Basically the students are asked to
>> execute a query in Drill using PyDrill then visualize the results.
>> 
>> Anyway, a student from Japan tried this, and was getting all kinds of
>> crazy errors.  So I sat down and worked with him to debug.  It turns out
>> that the period on the Japanese keyboard, maps to a different unicode
>> character than on US keyboards, and hence the queries throw errors.  I
>> discovered this because when I would cut/paste a query from a text file
>> that I wrote, the query executed, but if we typed one in, it broke.  After
>> digging around a bit, I found that it was the period character.
>> 
>> I’m not sure that this can or should be fixed, but I wanted to let people
>> know about this.
>> 
>> Best,
>> — C
> 
> 
> 
> 
> -- 
> Pradeeban Kathiravelu.
> Senior Systems Software Engineer, Emory University, Atlanta, GA, USA.
> Ph.D. Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
> INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
> Portugal.
> Université catholique de Louvain, Louvain-la-Neuve, Belgium.
> 
> Blog: [Llovizna] kkpradeeban.blogspot.com
> LinkedIn: www.linkedin.com/in/kpradeeban

Re: Interesting Errors

Posted by Pradeeban Kathiravelu <kk...@gmail.com>.
If I understand correctly, he was using the character ・

Different languages have different symbols.

The same thing can be said about the Chinese equivalent 。

The characters ・and 。 are entirely different from the "."
Not sure whether this needs to be fixed. If I am not completely
misunderstanding something, he will also fail if he attempts to use that
character (in place of ".") in programming languages (Java, C, ..).

Regards,
Pradeeban.

On Tue, Aug 7, 2018 at 10:02 AM, Charles Givre <cg...@gmail.com> wrote:

> Hello Drill Developers,
> I wanted to share an interesting development that happened yesterday.  I
> was teaching a class at BlackHat, and we have a worksheet that includes a
> Drill demonstration using PyDrill.   Basically the students are asked to
> execute a query in Drill using PyDrill then visualize the results.
>
> Anyway, a student from Japan tried this, and was getting all kinds of
> crazy errors.  So I sat down and worked with him to debug.  It turns out
> that the period on the Japanese keyboard, maps to a different unicode
> character than on US keyboards, and hence the queries throw errors.  I
> discovered this because when I would cut/paste a query from a text file
> that I wrote, the query executed, but if we typed one in, it broke.  After
> digging around a bit, I found that it was the period character.
>
> I’m not sure that this can or should be fixed, but I wanted to let people
> know about this.
>
> Best,
> — C




-- 
Pradeeban Kathiravelu.
Senior Systems Software Engineer, Emory University, Atlanta, GA, USA.
Ph.D. Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
Portugal.
Université catholique de Louvain, Louvain-la-Neuve, Belgium.

Blog: [Llovizna] kkpradeeban.blogspot.com
LinkedIn: www.linkedin.com/in/kpradeeban