You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Khurram Faraaz <kf...@maprtech.com> on 2015/10/22 00:26:12 UTC

[DISCUSS] Processing non-printable characters in Drill

Hi All,

This discussion is related to DRILL-2322. It looks like Drill processes
non-printable characters in both cases, with and without the new text
reader (exec.storage.enable_new_text_reader)

Should we throw an error since these are non-printable characters ? for
more details please take a look at JIRA DRILL-2322

Content from the csv file used in test
1,^A
2,^B
3,^C
4,^D
5,^E
6,^F

0: jdbc:drill:schema=dfs.tmp> select * from `nonPrintables.csv`;
+-----------------+
|     columns     |
+-----------------+
| ["1","\u0001"]  |
| ["2","\u0002"]  |
| ["3","\u0003"]  |
| ["4","\u0004"]  |
| ["5","\u0005"]  |
| ["6","\u0006"]  |
+-----------------+
6 rows selected (0.521 seconds)

0: jdbc:drill:schema=dfs.tmp> select columns[1] from `nonPrintables.csv`;
+---------+
| EXPR$0  |
+---------+
|        |
|        |
|        |
|        |
|        |
|        |
+---------+
6 rows selected (0.382 seconds)

Thanks,
Khurram

Re: [DISCUSS] Processing non-printable characters in Drill

Posted by Daniel Barclay <db...@maprtech.com>.
Khurram Faraaz wrote:
> ... It looks like Drill processes
> non-printable characters in both cases, with and without the new text
> reader (exec.storage.enable_new_text_reader)
>
> Should we throw an error since these are non-printable characters ?
No, I don't think so.  Does there seem to be any need to reject non-printable characters?

> ...
>
> Content from the csv file used in test
> 1,^A
> 2,^B
> 3,^C
> 4,^D
> 5,^E
> 6,^F
>
> 0: jdbc:drill:schema=dfs.tmp> select * from `nonPrintables.csv`;
> +-----------------+
> |     columns     |
> +-----------------+
> | ["1","\u0001"]  |
> | ["2","\u0002"]  |
> | ["3","\u0003"]  |
> | ["4","\u0004"]  |
> | ["5","\u0005"]  |
> | ["6","\u0006"]  |
> +-----------------+
> 6 rows selected (0.521 seconds)
>
> 0: jdbc:drill:schema=dfs.tmp> select columns[1] from `nonPrintables.csv`;
> +---------+
> | EXPR$0  |
> +---------+
> |        |
> |        |
> |        |
> |        |
> |        |
> |        |
> +---------+
> 6 rows selected (0.382 seconds)
Note what's going on there (re the difference between those two outputs):

In the first case, the strings with unprintable characters go through Drill's conversion of a value of a complex type (e.g., VARCHAR ARRAY) to a JSON string (in order to have a string to return through the JDBC API).  That conversion encodes string (VARCHAR) values as JSON string tokens, using JSON's escape sequences for the unprintable characters.  Finally, the resultant JSON string (the whole string of JSON, not the JSON string token) is displayed by SQLLine or the web UI or whatever.  (And don't forget the step of your copying and pasting into your message.)

In the second case, the core part of Drill is directly returning the characters  strings from the data through the JDBC API.  Then, SQLLine or the web UI or whatever is deciding how to display those strings--including how handle any special, e.g., unprintable, characters.  Evidently, SQLLine doesn't render unprintable characters into some visible form.  It probably just writes them to your terminal's output stream.  Since your terminal doesn't render them especially either, the characters still aren't visible, and when you copied to paste to compose your e-mail message, there was nothing from those special characters to copy.

(Actually, the non-printable characters are slightly visible--note how the six lines with visually blank values have terminating vertical-bar characters that don't line up with the other terminating "+" or "|" characters.)


 From the point of view of the core part of Drill, it's up to the client of the JDBC API to decide how to display values, including character string with unprintable characters.  (The JDBC API returns the Java representations (String objects) of the VARCHAR values.)


However, from the point of view of users, SQLLine (and Drill's web UI too) should render all values visibly, including character strings with unprintable characters.

(They should also render byte strings competently, e.g., rendering in hex the bytes themselves rather than displaying in hex the hash code of the Java byte array object that contains (a specific copy of) the bytes of the byte string(!).)


Daniel

-- 
Daniel Barclay
MapR Technologies