You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@impala.apache.org by "Yuanhao Luo (Code Review)" <ge...@cloudera.org> on 2016/06/06 14:18:07 UTC

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Yuanhao Luo has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3314

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Mean while other separators(e.g. escape char, line delimiter and key-map
delimiter) are allowed to have only one byte.

TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can get better performance via
better string matching algorithm such as KMP.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
14 files changed, 139 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser-test.cc
File be/src/exec/delimited-text-parser-test.cc:

> end-to-end tests live in the directory named 'tests'.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.

Jim Apple has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 6:

> I should go where to join in the discussion? Any hyperlinks?

Please start a discussion on

dev@impala.incubator.apache.org

The archives and address to email to start a subscription are

http://mail-archives.apache.org/mod_mbox/incubator-impala-dev/

-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has uploaded a new patch set (#2).

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Mean while other separators(e.g. escape char, line delimiter and key-map
delimiter) are only allowed to have one byte.

TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can get better performance via
better string matching algorithm such as KMP.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
14 files changed, 202 insertions(+), 67 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/3314/4//COMMIT_MSG
Commit Message:

Line 15: As a result, we would get poor performance if the multi-byte field
> What happens today if tuple delimiters and field delimiters are the same ch
After we get the first column, the following filed delimiter would be parsed as tuple delimiter, so this record ends too early and rest columns would be set to NULL. Test log in  https://issues.cloudera.org/browse/IMPALA-2428 has confirmed that.


Line 17: better string matching algorithm such as KMP.
> Today, are terminators allowed to contain '\0'?
No, if there were '\0' in terminators, it would failed to persist metadata to postgres. Just see the log in https://issues.cloudera.org/browse/IMPALA-2428. Besides, I was wondering whether have you test the statement "create table tb1(id int) row format delimited escaped by '\0';" or "create table tb1(id int) row format delimited lines terminated by '\0';" before, both of them failed in my tests with the exception arise by  postgres:"ERROR: invalid byte sequence for encoding "UTF8": 0x00". Even if I use "--encoding=LATIN1" to init postgres db, the same error occurs. So in my commit, I add this restriction to terminators of text file.


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.cc
File be/src/exec/delimited-text-parser.cc:

Line 46:   if (field_delim.size() == 1) {
> I see you have taken that out of the parser now. Why is that?
It didn't take effects in my tests. I'm going to move all row format constrains checking to CreateTableStmt::analyzeRowFormat(), what do you think?


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java:

Line 268:     String fieldDelim = StringEscapeUtils.unescapeJava(rowFormat_.getFieldDelimiter());
> Why isn't it already used elsewhere?
At the beginning I want to replace parseDelim() with this function, but problems occurs.
With commons-lang3, result of StringEscapeUtils.unescapeJava("\\043\\100\\043") is "#@#",result of StringEscapeUtils.unescapeJava("\\") is "";
but in commons-lang, the results are "043100043",'\',respectively. What's more, with this function I can't convert decimal string into byte(s). For example, with Byte.parseByte("35") we get 35, which is a decimal representation of '#', but with StringEscapeUtils.unescapeJava("35"), we just get string "35". Originally, we can use escaped by '35' or '\u0023' or '\043' or '#' to make '#' as escape char, but with this function we can only use the later three but not first '35'. It seems that to support multi-byte delimiter, we can't use decimal representation. For example, to use '###' as field delimiter, we can use fields terminated by '\u0023\043#',but not decimal 35.


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
File fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java:

Line 171:     if (escapeChar == (byte)fieldDelim.charAt(0) ||
> I do not see that fix in this code.
I will move all constrains checking to CreateTableStmt:analyzeRowFormat(), so if any constrain breaks, we can throw an exception to warn user, but not logging here.


http://gerrit.cloudera.org:8080/#/c/3314/4/tests/query_test/test_delimited_text.py
File tests/query_test/test_delimited_text.py:

Line 56:       fields terminated by '\002'
> Please name this more appropriately - it only tests failure modes right now
what about "test_break_row_format_constrain()"?


-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has uploaded a new patch set (#3).

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Mean while other separators(e.g. escape char, line delimiter and key-map
delimiter) are only allowed to have one byte.

TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can get better performance via
better string matching algorithm such as KMP.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
13 files changed, 191 insertions(+), 73 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/3
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has uploaded a new patch set (#7).

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Meanwhile other separators(e.g. escape char, line delimiter and key-map
delimiter) are only allowed to have one byte.

There are some constrains on terminators for text file:
1. Delimiters can't be an empty string
2. Tuple delimiter can't be the first byte of field delimiter
3. Escape character can't be the first byte of field delimiter
4. Escape character and tuple delimiter can't the be same value
5. Terminators can't contains '\0'

Warning: You can use only standard ASCII characters(with decimal value
from 0 to 127) in ascii or octal format to set filed terminator, but
not extended ASCII characters(with decimal value from 128 to 255) or
standard ASCII characters in unicode, decimal or hexadecimal format.
For example, to make standard ASCII characters "#@#" as field delimiter,
you can use fields terminated by '#\100\043', but not '\u0023', '35', '\x23'
respectively. I didn't find a solution to unescape decimal and hexadecimal
string. And there's a bug for SqlParser.parse() to parse unicode string and
octol value of extended ASCII characters. I have opened a issue in
https://issues.cloudera.org/browse/IMPALA-3777. After fixing this, we
can also use unicode string and octol string for extended ASCII characters.

Other one-byte terminators are still allow to use decimal value.

TODO: Thinking it's hard to use SSE4_2 for multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can add a constrain on the length
of filed terminator.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
A testdata/data/text-commacomma-backslash-newline.txt
A testdata/data/text-dollarhash-hash-pipe.txt
A testdata/data/text-hashathash-ecirc-newline.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test
M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test
M tests/query_test/test_delimited_text.py
21 files changed, 460 insertions(+), 91 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/7
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 6:

I should go where to join in the discussion? Any hyperlinks?

-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.

Jim Apple has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 6:

> OK, I'm pausing my code review while we do a design review. I have
 > suggested the dev@apache mailing list for this design review.

In case it wasn't clear, I'm suggesting that you start that discussion and present your POV.

-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has uploaded a new patch set (#6).

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Mean while other separators(e.g. escape char, line delimiter and key-map
delimiter) are only allowed to have one byte.

There some constrains on terminators for text file:
1. Field delimiter can't be empty
2. Tuple delimiter can't be the first byte of field delimiter
3. Escape character can't be the first byte of field delimiter
4. Terminators can't contains '\0' in text file

Warning: You can use character or octal in filed terminator, but not
unicode, decimal and hexadecimal. For example, to make "###" as field
delimiter, you can use fields terminated by '\043#', but not '\u0023',
'35', '\x32' respectively. I didn't find a solution to unescape decimal
and hexadecimal string. And there's a bug for SqlParser.parse() to parse
unicode string. I have opened a issue in
https://issues.cloudera.org/browse/IMPALA-3777. After fixing this, we
can also use unicode string.

Other one-byte terminators are still allow to use decimal value.

TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can get better performance via
better string matching algorithm such as KMP.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
A testdata/data/text-commacomma-backslash-newline.txt
A testdata/data/text-dollarhash-hash-pipe.txt
A testdata/data/text-hashathash-ecirc-newline.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test
M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test
M tests/query_test/test_delimited_text.py
21 files changed, 390 insertions(+), 77 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/6
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.

Jim Apple has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 4:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/3314/4//COMMIT_MSG
Commit Message:

Line 15: 2. tuple delimiter could not be the first byte of field delimiter
What happens today if tuple delimiters and field delimiters are the same character?

Same question about escape character.


Line 17: 4. terminators could not contains '\0'
Today, are terminators allowed to contain '\0'?


http://gerrit.cloudera.org:8080/#/c/3314/4/be/src/exec/delimited-text-parser-test.cc
File be/src/exec/delimited-text-parser-test.cc:

Line 173:   Validate(&single_null_no_escape_parser, string("a\0b|c\0d|e\0f", 11), 4, TUPLE_DELIM, 1, 3);
long line (here and elsewhere). Consider using clang-format to help you find and fix these, though please do no clang format code you are not changing.


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.cc
File be/src/exec/delimited-text-parser.cc:

Line 46:   DCHECK(tuple_delim != field_delim[0]);
> size of field_delim could not be 0. We have checked that in sql-parser.y an
I see you have taken that out of the parser now. Why is that?


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java:

Line 268:     String fieldDelim = StringEscapeUtils.unescapeJava(rowFormat_.getFieldDelimiter());
> Just as parseDelim(), it turns escaped string into un-escaped string.For ex
Why isn't it already used elsewhere?


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
File fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java:

Line 171:     if (escapeChar == (byte)fieldDelim.charAt(0) ||
> Done
I do not see that fix in this code.


http://gerrit.cloudera.org:8080/#/c/3314/4/tests/query_test/test_delimited_text.py
File tests/query_test/test_delimited_text.py:

Line 56:   def test_create_table_rowformat(self, vector, unique_database):
Please name this more appropriately - it only tests failure modes right now, so please name it as such.


Line 61:       4. terminator could not contain \0."""
Please add these notes on each test so we know what is being tested in each place.


-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has abandoned this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Abandoned

I will re-push this commit when Impala comes to 3.0

-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has uploaded a new patch set (#5).

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Mean while other separators(e.g. escape char, line delimiter and key-map
delimiter) are only allowed to have one byte.

There some constrains on terminators for text file:
1. Field delimiter can't be empty
2. Tuple delimiter can't be the first byte of field delimiter
3. Escape character can't be the first byte of field delimiter
4. Terminators can't contains '\0' in text file

Warning: You can use character or octal in filed terminator, but not
unicode, decimal and hexadecimal. For example, to make "###" as field
delimiter, you can use fields terminated by '\043#', but not '\u0023',
'35', '\x32' respectively. I didn't find a solution to unescape decimal
and hexadecimal string. And there's a bug for SqlParser.parse() to parse
unicode string. I have opened a issue in
https://issues.cloudera.org/browse/IMPALA-3777. After fixing this, we
can also use unicode string.

Other one-byte terminators are still allow to use decimal value.

TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can get better performance via
better string matching algorithm such as KMP.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
A testdata/data/text-commacomma-backslash-newline.txt
A testdata/data/text-dollarhash-hash-pipe.txt
A testdata/data/text-hashathash-ecirc-newline.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test
M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test
M tests/query_test/test_delimited_text.py
21 files changed, 390 insertions(+), 77 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/5
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.

Jim Apple has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 6:

Today, Impala's official source is at https://github.com/cloudera/Impala/. As of Monday, July 25, it will be at https://git-wip-us.apache.org/repos/asf?p=incubator-impala.git

When that happens, you will have to re-send your changes to be reviewed on gerrit, with some different git configuration. When the new configuration is ready, a notice will go to http://mail-archives.apache.org/mod_mbox/incubator-impala-dev/ . If you aren\u2019t subscribed to that list already, you should subscribe now.

-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 5:

Before reviewing this commit, please take a look at http://issues.cloudera.org/browse/IMPALA-2428

-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.

Jim Apple has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser-test.cc
File be/src/exec/delimited-text-parser-test.cc:

> Done
end-to-end tests live in the directory named 'tests'.


-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has uploaded a new patch set (#4).

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Mean while other separators(e.g. escape char, line delimiter and key-map
delimiter) are only allowed to have one byte.

There some constrains on terminators:
1. field delimiter could not be empty
2. tuple delimiter could not be the first byte of field delimiter
3. escape character could not be the first byte of field delimiter
4. terminators could not contains '\0'

TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can get better performance via
better string matching algorithm such as KMP.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
A testdata/data/text-commacomma-backslash-newline.txt
A testdata/data/text-dollarhash-hash-pipe.txt
A testdata/data/text-hashathash-ecirc-newline.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test
M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test
M tests/query_test/test_delimited_text.py
21 files changed, 405 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/4
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Yuanhao Luo (Code Review)" <ge...@cloudera.org>.

Yuanhao Luo has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 1:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/3314/1//COMMIT_MSG
Commit Message:

PS1, Line 11: allowed to have only one byte.
> "allowed to have only one byte" and "only allowed to have one byte" mean sl
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser-test.cc
File be/src/exec/delimited-text-parser-test.cc:

> Can you also add some end-to-end tests, please?
Done


PS1, Line 82: filed
> "field"
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.cc
File be/src/exec/delimited-text-parser.cc:

Line 46:   if (field_delim.size() == 1) {
> How should DelimitedTextParse behave if field_delim.size() == 0?
size of field_delim could not be 0. We have checked that in sql-parser.y and HdfsStorageDescriptor.java.


Line 145:           || **byte_buffer_ptr == collection_item_delim_) {
> This condition should probably go first, since it is cheaper.
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.h
File be/src/exec/delimited-text-parser.h:

Line 46:       char tuple_delim, const string& field_delim_ = string("\0", 1),
> Are there any restrictions about field_delim_? For instance, is the string 
I have tested field_delim_ as "\0\0", that's ok. For now, tuple_delim could not be the first byte of field_delim. Refer to https://gerrit.cloudera.org/#/c/3314/1/be/src/exec/delimited-text-parser.cc@139 , if the tuple_delim is as the same as first byte of field_delim, then if field_delim occurs, it would be parsed as a tuple delim and the left string would be parsed as a new tuple, but this is not we want.


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

Line 202:         || str_val->ptr[i] == escape_char_) {
> This condition should probably go first, since it is cheaper
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

Line 210:           || str_val->ptr[i] == escape_char_)) {
> This condition should probably go first, since it is easier to check.
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java:

Line 268:     String fieldDelim = StringEscapeUtils.unescapeJava(rowFormat_.getFieldDelimiter());
> Can you explain why it makes sense to use of unescapeJava here? I notice it
Just as parseDelim(), it turns escaped string into un-escaped string.For example, after calling String e = StringEscapeUtils.unescapeJava("\\043\\100\\043"); the value of e is "#@#";


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
File fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java:

PS1, Line 44: /h
> missing space between / and h
Done


Line 171:     if (escapeChar == (byte)fieldDelim.charAt(0) ||
> Should you check before here that fieldDelim is not null and has length > 0
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.

Jim Apple has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 6:

OK, I'm pausing my code review while we do a design review. I have suggested the dev@apache mailing list for this design review.

-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.

Jim Apple has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field delimiter
......................................................................


Patch Set 1:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/3314/1//COMMIT_MSG
Commit Message:

PS1, Line 11: allowed to have only one byte.
"allowed to have only one byte" and "only allowed to have one byte" mean slightly different things. I assume you mean the latter?


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser-test.cc
File be/src/exec/delimited-text-parser-test.cc:

Can you also add some end-to-end tests, please?


PS1, Line 82: filed
"field"


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.cc
File be/src/exec/delimited-text-parser.cc:

Line 46:   if (field_delim.size() == 1) {
How should DelimitedTextParse behave if field_delim.size() == 0?


Line 145:           || **byte_buffer_ptr == collection_item_delim_) {
This condition should probably go first, since it is cheaper.


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.h
File be/src/exec/delimited-text-parser.h:

Line 46:       char tuple_delim, const string& field_delim_ = string("\0", 1),
Are there any restrictions about field_delim_? For instance, is the string "\0\0" allowed? Can tuple_delim be the first character of field_delim_? What about the second?


Line 153:   /// SSE(xmm) register containing the tuple search character(s).
This and the other comments about SSE should be clarified to explain how they behave when field_delim_ is not a single character.


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

Line 202:         || str_val->ptr[i] == escape_char_) {
This condition should probably go first, since it is cheaper


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

Line 210:           || str_val->ptr[i] == escape_char_)) {
This condition should probably go first, since it is easier to check.


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java:

Line 268:     String fieldDelim = StringEscapeUtils.unescapeJava(rowFormat_.getFieldDelimiter());
Can you explain why it makes sense to use of unescapeJava here? I notice it isn't used elsewhere in the analyzer.


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
File fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java:

PS1, Line 44: /h
missing space between / and h


Line 171:     if (escapeChar == (byte)fieldDelim.charAt(0) ||
Should you check before here that fieldDelim is not null and has length > 0?


-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <lu...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-HasComments: Yes