You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "John Sherman (Jira)" <ji...@apache.org> on 2022/09/27 23:45:00 UTC
[jira] [Commented] (HIVE-26320) Incorrect case evaluation for Parquet based table

    [ https://issues.apache.org/jira/browse/HIVE-26320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610300#comment-17610300 ] 

John Sherman commented on HIVE-26320:
-------------------------------------

So digging into this - the root cause seems to be that the Parquet SerDe will return STRING/Text types for CHAR and VARCHAR.
The code in GenericUDFIn creates a Set<> containing the constant part of the IN clause which also includes the correct types.

So the constant IN Set<> would have entries like:
struct<kob varchar(2),enhanced_type_code int> which is basically a List containing a HiveVarcharWritable and an IntWritable.

The Parquet reader seems to produce:
struct<kob String, enhanced_type_code int> which is basically a List containing a Text and a IntWritable.

So the constant IN set doesn't match the rows produced due to the types being different.
I could likely fix this for the GenericUDFIn case - but I suspect there are other areas in which this type difference causes different results. So I've attempted to fix this at the SerDe deserialization level by converting the row data from PARQUET to the appropriate types (if needed).

I am a bit concerned with if this has performance implication,  though I am hoping the conversion between the string types are relatively low overhead.

I've posted a WIP patch to see how it does on tests and I will likely investigate a couple of other areas.

> Incorrect case evaluation for Parquet based table
> -------------------------------------------------
>
>                 Key: HIVE-26320
>                 URL: https://issues.apache.org/jira/browse/HIVE-26320
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2, Query Planning
>    Affects Versions: 4.0.0-alpha-1
>            Reporter: Chiran Ravani
>            Assignee: John Sherman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Query involving case statement with two or more conditions leads to incorrect result for tables with parquet format, The problem is not observed with ORC or TextFile.
> *Steps to reproduce*:
> {code:java}
> create external table case_test_parquet(kob varchar(2),enhanced_type_code int) stored as parquet;
> insert into case_test_parquet values('BB',18),('BC',18),('AB',18);
> select case when (
>                    (kob='BB' and enhanced_type_code='18')
>                    or (kob='BC' and enhanced_type_code='18')
>                  )
>             then 1
>             else 0
>         end as logic_check
> from case_test_parquet;
> {code}
> Result:
> {code}
> 0
> 0
> 0
> {code}
> Expected result:
> {code}
> 1
> 1
> 0
> {code}
> The problem does not appear when setting hive.optimize.point.lookup=false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)