You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2019/09/10 00:50:00 UTC

[jira] [Resolved] (IMPALA-3945) Don't allow creation of text tables with nonsensical delimiter and escape character combinations

     [ https://issues.apache.org/jira/browse/IMPALA-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-3945.
-----------------------------------
    Resolution: Won't Fix

I don't think this is likely to be worth the effort.

> Don't allow creation of text tables with nonsensical delimiter and escape character combinations
> ------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-3945
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3945
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.5.0
>         Environment: CentOS 6.7
>            Reporter: Yuanhao Luo
>            Priority: Minor
>              Labels: compatibility, newbie, usability
>
> There are some corner cases for delimiter. All of them are added in function CreateTableStmt.java:analyzeRowFormat().
> Such as:
> # AnalysisException: Field delimiter and line delimiter have same value
> # Warning:  Field delimiter and escape character have same value
> # Warning: Line delimiter and escape character have same value
> I have run a simple test on last two cases and the result shows that it doesn't work as we expected.
> Detail logs as below:
> * Normal case
> {noformat}
> [root@nobida147 workspace]# cat text-comma-backslash-newline.txt 
> one,two,3,4
> one\,one,two,3,4
> one\\,two,3,4
> one\\\,one,two,3,4
> one\\\\,two,3,4
> [nobida147:21000] > create table text_comma_backslash_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',' escaped by '\\' lines terminated by '\n';
> Query: create table text_comma_backslash_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',' escaped by '\\' lines terminated by '\n'
> Query submitted at: 2016-07-25 15:40:25 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=cc4f0a970ac242ac:a7bde0a84aa49c8c
> ++
> ||
> ++
> ++
> Fetched 0 row(s) in 0.14s
> [nobida147:21000] > load data inpath '/user/root/text-comma-backslash-newline.txt' into table text_comma_backslash_newline;
> Query: load data inpath '/user/root/text-comma-backslash-newline.txt' into table text_comma_backslash_newline
> Query submitted at: 2016-07-25 15:40:38 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=1f4c908335a41010:1006f8153e8068ab
> +----------------------------------------------------------+
> | summary                                                  |
> +----------------------------------------------------------+
> | Loaded 1 file(s). Total files in destination location: 1 |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 5.05s
> [nobida147:21000] > select * from text_comma_backslash_newline;
> Query: select * from text_comma_backslash_newline
> Query submitted at: 2016-07-25 15:40:49 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=9e473f7fe5822ca4:b663f16106e0f87
> +----------+------+------+------+
> | col1     | col2 | col3 | col4 |
> +----------+------+------+------+
> | one      | two  | 3    | 4    |
> | one,one  | two  | 3    | 4    |
> | one\     | two  | 3    | 4    |
> | one\,one | two  | 3    | 4    |
> | one\\    | two  | 3    | 4    |
> +----------+------+------+------+
> Fetched 5 row(s) in 0.44s
> {noformat}
> As above log shows, delimiter text parser works as expected.
> * Corner case: Field delimiter and escape character have same value
> {noformat}
> [root@nobida147 workspace]# cat text-at-at-newline.txt 
> one@two@3@4
> one@,one@two@3@4
> one@\@two@3@4
> one@\@,one@two@3@4
> one@\@\@two@3@4
> [nobida147:21000] > create table text_at_at_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by '@' escaped by '@' lines terminated by '\n';
> Query: create table text_at_at_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by '@' escaped by '@' lines terminated by '\n'
> Query submitted at: 2016-07-25 16:59:23 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=9d4933d6e0c32dcd:92f6ade14fb545ba
> ++
> ||
> ++
> ++
> WARNINGS: Escape character is the first byte of field delimiter: byte @. Escape character will be ignored
> Fetched 0 row(s) in 0.12s
> [nobida147:21000] > load data inpath '/user/root/text-at-at-newline.txt' into table text_at_at_newline;
> Query: load data inpath '/user/root/text-at-at-newline.txt' into table text_at_at_newline
> Query submitted at: 2016-07-25 16:59:33 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=364542fc348b5e8c:a0a67958fcf68182
> +----------------------------------------------------------+
> | summary                                                  |
> +----------------------------------------------------------+
> | Loaded 1 file(s). Total files in destination location: 1 |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 5.84s
> [nobida147:21000] > select * from text_at_at_newline;
> Query: select * from text_at_at_newline
> Query submitted at: 2016-07-25 16:59:48 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=c6414569a57f2501:a25456e04ce6d998
> +------+------+------+------+
> | col1 | col2 | col3 | col4 |
> +------+------+------+------+
> | one  | two  | 3    | 4    |
> | one  | ,one | NULL | 3    |
> | one  | \    | NULL | 3    |
> | one  | \    | NULL | NULL |
> | one  | \    | NULL | NULL |
> +------+------+------+------+
> WARNINGS: Error converting column: 2 TO INT (Data is: two)
> file: hdfs://localhost:20500/test-warehouse/multi_byte_test2.db/text_at_at_newline/text-at-at-newline.txt
> record: one@,one@two@3@4
> Error converting column: 2 TO INT (Data is: two)
> file: hdfs://localhost:20500/test-warehouse/multi_byte_test2.db/text_at_at_newline/text-at-at-newline.txt
> record: one@\@two@3@4
> Error converting column: 2 TO INT (Data is: ,one)
> Error converting column: 3 TO INT (Data is: two)
> file: hdfs://localhost:20500/test-warehouse/multi_byte_test2.db/text_at_at_newline/text-at-at-newline.txt
> record: one@\@,one@two@3@4
> Error converting column: 2 TO INT (Data is: \)
> Error converting column: 3 TO INT (Data is: two)
> file: hdfs://localhost:20500/test-warehouse/multi_byte_test2.db/text_at_at_newline/text-at-at-newline.txt
> record: one@\@\@two@3@4
> Fetched 5 row(s) in 0.44s
> {noformat}
> As above log shows, even we add Warning for this case, delimiter text parser doesn't work as expected. 
> For result of second row "| one   | ,one  | NULL | 3    | ", original line is "one@,one@two@3@4".Taking '@' as escaped character and field delimiter, we expect the first '@' is treated as escaped character, so the value of first column would be 'one,one'. However, the text parser treated the first '@' as field delimiter, so the value of first column is 'one' and the value of second column comes to ',one'.
> * Corner case: Line delimiter and escape character have same value
> {noformat}
> [root@nobida147 workspace]# cat text-comma-backslash-backslash.txt 
> one,two,3,4\one\,one,two,3,4\one\\,two,3,4\one\\\,one,two,3,4\one\\\\,two,3,4
> [nobida147:21000] > create table text_comma_backslash_backslash(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',' escaped by '\\' lines terminated by '\\';
> Query: create table text_comma_backslash_backslash(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',' escaped by '\\' lines terminated by '\\'
> Query submitted at: 2016-07-25 18:39:08 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=f9482b9325a29355:751546a6b3640ebf
> ++
> ||
> ++
> ++
> WARNINGS: Line delimiter and escape character have same value: byte 92. Escape character will be ignored
> Fetched 0 row(s) in 0.12s
> [nobida147:21000] > load data inpath '/user/root/text-comma-backslash-backslash.txt' into table text_comma_backslash_backslash;
> Query: load data inpath '/user/root/text-comma-backslash-backslash.txt' into table text_comma_backslash_backslash
> Query submitted at: 2016-07-25 18:39:42 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=a442f76958434721:4ddd2cdc78e3bbbd
> +----------------------------------------------------------+
> | summary                                                  |
> +----------------------------------------------------------+
> | Loaded 1 file(s). Total files in destination location: 1 |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 4.09s
> [nobida147:21000] > select * from text_comma_backslash_backslash;
> Query: select * from text_comma_backslash_backslash
> Query submitted at: 2016-07-25 18:39:58 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=6e41d94e5e4458f3:d2482d108394d6a0
> +------+------+------+------+
> | col1 | col2 | col3 | col4 |
> +------+------+------+------+
> | one  | two  | 3    | 4    |
> | one  | NULL | NULL | NULL |
> |      | one  | NULL | 3    |
> |      | NULL | NULL | NULL |
> |      | two  | 3    | NULL |
> |      | NULL | NULL | NULL |
> |      | NULL | NULL | NULL |
> |      | one  | NULL | 3    |
> |      | NULL | NULL | NULL |
> |      | NULL | NULL | NULL |
> |      | NULL | NULL | NULL |
> |      | two  | 3    | 4    |
> +------+------+------+------+
> WARNINGS: Error converting column: 2 TO INT (Data is: two)
> file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_comma_backslash_backslash/text-comma-backslash-backslash.txt
> record: ,one,two,3,4one
> Error converting column: 3 TO INT (Data is: 4one)
> file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_comma_backslash_backslash/text-comma-backslash-backslash.txt
> record: ,two,3,4one
> Error converting column: 2 TO INT (Data is: two)
> file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_comma_backslash_backslash/text-comma-backslash-backslash.txt
> record: ,one,two,3,4one
> Fetched 12 row(s) in 0.44s
> {noformat}
> Again, as above log shows, even we add Warning for this case, delimiter text parser doesn't work as expected. 
> For result of second row "| one  | NULL | NULL | NULL |", the original value is "one\,one,two,3,4". We expect to treat first backslash as escaped character, however, the text parser take it as the tuple delimiter, so the value of col1 is 'one' and col2, col3,col4 turn to be 'NULL'.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org