You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2015/10/09 00:14:26 UTC
[jira] [Comment Edited] (DRILL-3712) Drill does not recognize
UTF-16-LE encoding
[ https://issues.apache.org/jira/browse/DRILL-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949489#comment-14949489 ]
Deneche A. Hakim edited comment on DRILL-3712 at 10/8/15 10:13 PM:
-------------------------------------------------------------------
[~ebegoli] I did the following using the latest master:
- I used your script to create a text.psv file
- I created a gzipped version of the file (just .gz not tar.gz)
- I updated the "psv" definition in my dfs storage plugin like this:
{noformat}
"psv": {
"type": "text",
"extensions": [
"tbl",
"psv"
],
"skipFirstLine": true,
"delimiter": "|"
}
{noformat}
Here are the results I get when I query the file:
{noformat}
0: jdbc:drill:zk=local> select * from dfs.data.`test.psv.gz`;
+--------------------------------------------------------------------------------------------+
| columns |
+--------------------------------------------------------------------------------------------+
| ["value A0","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C0"] |
| ["value A1","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C1"] |
| ["value A2","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C2"] |
| ["value A3","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C3"] |
| ["value A4","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C4"] |
| ["value A5","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C5"] |
| ["value A6","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C6"] |
| ["value A7","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C7"] |
| ["value A8","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C8"] |
| ["value A9","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C9"] |
+--------------------------------------------------------------------------------------------+
10 rows selected (0.136 seconds)
{noformat}
{noformat}
0: jdbc:drill:zk=local> select columns[0], columns[1], columns[2] from dfs.data.`test.psv.gz`;
+-----------+---------------------+-----------+
| EXPR$0 | EXPR$1 | EXPR$2 |
+-----------+---------------------+-----------+
| value A0 | Encoded B | value C0 |
| value A1 | Encoded B | value C1 |
| value A2 | Encoded B | value C2 |
| value A3 | Encoded B | value C3 |
| value A4 | Encoded B | value C4 |
| value A5 | Encoded B | value C5 |
| value A6 | Encoded B | value C6 |
| value A7 | Encoded B | value C7 |
| value A8 | Encoded B | value C8 |
| value A9 | Encoded B | value C9 |
+-----------+---------------------+-----------+
10 rows selected (0.194 seconds)
{noformat}
Do you have more details about how to reproduce the issues you are seeing ?
was (Author: adeneche):
[~ebegoli] I did the following using the latest master:
- I used your script to create a text.psv file
- I created a gzipped version of the file (just .gz not tar.gz)
- I updated the "psv" definition in my dfs storage plugin like this:
{noformat}
"psv": {
"type": "text",
"extensions": [
"tbl",
"psv"
],
"skipFirstLine": true,
"delimiter": "|"
}
{noformat}
Here are the results I get when I query the file:
{noformat}
0: jdbc:drill:zk=local> select * from dfs.data.`test.psv.gz`;
+--------------------------------------------------------------------------------------------+
| columns |
+--------------------------------------------------------------------------------------------+
| ["value A0","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C0"] |
| ["value A1","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C1"] |
| ["value A2","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C2"] |
| ["value A3","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C3"] |
| ["value A4","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C4"] |
| ["value A5","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C5"] |
| ["value A6","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C6"] |
| ["value A7","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C7"] |
| ["value A8","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C8"] |
| ["value A9","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C9"] |
+--------------------------------------------------------------------------------------------+
10 rows selected (0.136 seconds)
{noformat}
{noformat}
0: jdbc:drill:zk=local> select columns[0], columns[1], columns[2] from dfs.data.`test.psv.gz`;
+-----------+---------------------+-----------+
| EXPR$0 | EXPR$1 | EXPR$2 |
+-----------+---------------------+-----------+
| value A0 | Encoded B | value C0 |
| value A1 | Encoded B | value C1 |
| value A2 | Encoded B | value C2 |
| value A3 | Encoded B | value C3 |
| value A4 | Encoded B | value C4 |
| value A5 | Encoded B | value C5 |
| value A6 | Encoded B | value C6 |
| value A7 | Encoded B | value C7 |
| value A8 | Encoded B | value C8 |
| value A9 | Encoded B | value C9 |
+-----------+---------------------+-----------+
10 rows selected (0.194 seconds)
{noformat}
> Drill does not recognize UTF-16-LE encoding
> -------------------------------------------
>
> Key: DRILL-3712
> URL: https://issues.apache.org/jira/browse/DRILL-3712
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Text & CSV
> Affects Versions: 1.1.0
> Environment: OSX, likely Linux.
> Reporter: Edmon Begoli
> Fix For: Future
>
>
> We are unable to process files that OSX identifies as character sete UTF16LE. After unzipping and converting to UTF8, we are able to process one fine. There are CONVERT_TO and CONVERT_FROM commands that appear to address the issue, but we were unable to make them work on a gzipped or unzipped version of the UTF16 file. We were able to use CONVERT_FROM ok, but when we tried to wrap the results of that to cast as a date, or anything else, it failed. Trying to work with it natively caused the double-byte nature to appear (a substring 1,4 only return the first two characters).
> I cannot post the data because it is proprietary in nature, but I am posting this code that might be useful in re-creating an issue:
> {noformat}
> #!/usr/bin/env python
> """ Generates a test psv file with some text fields encoded as UTF-16-LE. """
> def write_utf16le_encoded_psv():
> total_lines = 10
> encoded = "Encoded B".encode("utf-16-le")
> with open("test.psv","wb") as csv_file:
> csv_file.write("header 1|header 2|header 3\n")
> for i in xrange(total_lines):
> csv_file.write("value A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n")
> if __name__ == "__main__":
> write_utf16le_encoded_psv()
> {noformat}
> then:
> tar zcvf test.psv.tar.gz test.psv
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)