You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/08/13 23:26:00 UTC

[jira] [Commented] (IMPALA-1652) Fix CHAR datatype: Incorrect results with basic predicate on CHAR typed column.

    [ https://issues.apache.org/jira/browse/IMPALA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177374#comment-17177374 ] 

Tim Armstrong commented on IMPALA-1652:
---------------------------------------

I looked at this again and I think the initial analysis about whether comparisons were whitespace sensitive may have made this seem harder to fix than it actually is.

There are two general approaches I can see to doing comparisons that ignore trailing whitespace from CHAR columns
# Custom comparison operators that are trailing-whitespace-insensitive. 
# Stripping trailing whitespace from CHAR columns, then doing a trailing-whitespace-sensitive comparison.

Mostly the two produce the same results, with the exception if there is a non-CHAR column in the comparison with trailing spaces. In that case the first approach would ignore that trailing space and the second would preserve it.

I think we'd been assuming that we'd have to do the first, but I think the second is actually consistent with Hive's behaviour.

Postgres does do something similar to the first, at least for VARCHAR <-> CHAR comparisons, but Hive does the second:
{noformat}
postgres=# select cast('foo' as char(50)) = cast('foo ' as varchar);
 ?column? 
----------
 t
(1 row)
0: jdbc:hive2://localhost:11050> select cast('foo' as char(2)) = cast('foo ' as varchar(50));
+--------+
|  _c0   |
+--------+
| false  |
+--------+
{noformat}

Moreover, both systems strip whitespace when casting from CHAR to other types, e.g.
{noformat}
postgres=# select cast(cast('foo' as char(50)) as text) || '|';
 ?column? 
----------
 foo|
0: jdbc:hive2://localhost:11050> select cast(cast('foo' as char(50)) as string) || '|';
+-------+
|  _c0  |
+-------+
| foo|  |
+-------+
{noformat}

Impala is the clear outlier here:
{noformat}
[localhost.EXAMPLE.COM:21000] default> select cast(cast('foo' as char(50)) as string) || '|';
+------------------------------------------------------+
| concat(cast(cast('foo' as char(50)) as string), '|') |
+------------------------------------------------------+
| foo                                               |  |
+------------------------------------------------------+
{noformat}

Postgres seems to strip trailing whitespace when passing arguments to functions, but Hive does. Postgres weirdly behaves differently when using the || operator.
{noformat}
postgres=# select concat(cast('foo' as char(10)), '|');
   concat    
-------------
 foo       |
(1 row)

0: jdbc:hive2://localhost:11050> select concat(cast('foo' as char(10)), '|');
+-------+
|  _c0  |
+-------+
| foo|  |
+-------+
{noformat}

Where I'm going with this is that it seems like we'd get behaviour that was the same as Hive if casts from CHAR to STRING/VARCHAR stripped trailing whitespace. The downside of doing that is that we'd change some behaviour that is currently consistent with postgres and is fairly reasonable.

So I think roughly the pros and cons are:

Option 1:
* More consistent with postgres (but not fully consistent - would likely not implement all quirks)
* Slightly fewer changes from previous Impala behaviour 
Option 2:
* Consistent with Hive
* Simpler to implement
* More internally consistent

So we should consider fixing this by making casts from char to string strip trailing whitespace. It would need a bit more investigation to see if that fully matches Hive behaviour, but I think it seems viable.

> Fix CHAR datatype: Incorrect results with basic predicate on CHAR typed column.
> -------------------------------------------------------------------------------
>
>                 Key: IMPALA-1652
>                 URL: https://issues.apache.org/jira/browse/IMPALA-1652
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.1, Impala 2.3.0
>            Reporter: Alexander Behm
>            Priority: Major
>              Labels: correctness, downgraded, usability
>
> Repro:
> {code}
> create table foo(col1 char(10));
> insert into foo values (cast('test1' as char(10)));
> select * from foo where col1 = 'test1'; <-- returns an empty result set
> select * from foo where col1 = cast('test1' as char(10)); <-- correctly returns 1 row
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org