You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2022/10/18 16:08:00 UTC

[jira] [Commented] (HIVE-26639) ConstantVectorExpression and ExplainTask shouldn't rely on default charset

    [ https://issues.apache.org/jira/browse/HIVE-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619678#comment-17619678 ] 

László Bodor commented on HIVE-26639:
-------------------------------------

merged to master, thanks [~ayushtkn] for the review!

> ConstantVectorExpression and ExplainTask shouldn't rely on default charset
> --------------------------------------------------------------------------
>
>                 Key: HIVE-26639
>                 URL: https://issues.apache.org/jira/browse/HIVE-26639
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> In HS2 (and other components) we rely on UTF8 encoding, hence while storing strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on default system encoding in different ways, which can lead to incorrect encoding (if system settings defaults other than UTF8). This patch intends to fix 2 different paths:
> 1. ConstantVectorExpression
> in my case, this:
> {code}
> LOG.info("default charset name: " + java.nio.charset.Charset.defaultCharset().name());
> LOG.info("getBytes() = " + ((String) constantValue).getBytes());
> LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String) constantValue).getBytes(StandardCharsets.UTF_8));
> {code}
> led to:
> {code}
> default charset name: US-ASCII
> getBytes() = [B@73dcffb0
> getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c
> {code}
> on the customer side, queries returned wrong results when the filter contained the special character (which is part of UTF8 character table):
> {code}
> SELECT b FROM default.rlv_test1 where b='北京';
> ....
> ??
> {code}
> 2. Explain
> Similarly, explain printed to a PrintStream of different encoding, leading to a plan like:
> {code}
> 	            Map Operator Tree:
> 	                TableScan
> 	                  alias: test_table
> 	                  filterExpr: (b = '??') (type: boolean)
> 	                  Statistics: Num rows: 2 Data size: 352 Basic stats: COMPLETE Column stats: COMPLETE
> 	                  Filter Operator
> 	                    predicate: (b = '??') (type: boolean)
> 	                    Statistics: Num rows: 2 Data size: 352 Basic stats: COMPLETE Column stats: COMPLETE
> 	                    Select Operator
> 	                      expressions: a (type: int), '??' (type: string), c (type: string)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)