You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Marta Kuczora (JIRA)" <ji...@apache.org> on 2019/06/26 13:28:00 UTC

[jira] [Commented] (HIVE-21407) Parquet predicate pushdown is not working correctly for char column types

    [ https://issues.apache.org/jira/browse/HIVE-21407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873330#comment-16873330 ] 

Marta Kuczora commented on HIVE-21407:
--------------------------------------

Hi [~kgyrtkirk],
the padded value in the OR expression is needed because when the predicate is created we only know that it is a STRING type. This could mean fixed size char, varchar and string Hive types. And for varchar and string types storing a value with spaces at the end is valid. So if we don't add the padded value to the OR expression and we search for a value with spaces at the end on a varchar column, it won't find it. 

About your other question:
"if the values are stored without this padding - at which part this padding happens; inside Hive or in Parquet?"
It happens within Hive, not in Parquet. But changing this could lead to interoperability problems, since all previous versions of Hive wrote the char values like this to Parquet.

> Parquet predicate pushdown is not working correctly for char column types
> -------------------------------------------------------------------------
>
>                 Key: HIVE-21407
>                 URL: https://issues.apache.org/jira/browse/HIVE-21407
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>         Attachments: HIVE-21407.2.patch, HIVE-21407.3.patch, HIVE-21407.patch
>
>
> If the 'hive.optimize.index.filter' parameter is false, the filter predicate is not pushed to parquet, so the filtering only happens within Hive. If the parameter is true, the filter is pushed to parquet, but for a char type, the value which is pushed to Parquet will be padded with spaces:
> {noformat}
>   @Override
>   public void setValue(String val, int len) {
>     super.setValue(HiveBaseChar.getPaddedValue(val, len), -1);
>   }
> {noformat} 
> So if we have a char(10) column which contains the value "apple" and the where condition looks like 'where c='apple'', the value pushed to Paquet will be 'apple' followed by 5 spaces. But the stored values are not padded, so no rows will be returned from Parquet.
> How to reproduce:
> {noformat}
> $ create table ppd (c char(10), v varchar(10), i int) stored as parquet;
> $ insert into ppd values ('apple', 'bee', 1),('apple', 'tree', 2),('hello', 'world', 1),('hello','vilag',3);
> $ set hive.optimize.ppd.storage=true;
> $ set hive.vectorized.execution.enabled=true;
> $ set hive.vectorized.execution.enabled=false;
> $ set hive.optimize.ppd=true;
> $ set hive.optimize.index.filter=true;
> $ set hive.parquet.timestamp.skip.conversion=false;
> $ select * from ppd where c='apple';
> +--------+--------+--------+
> | ppd.c  | ppd.v  | ppd.i  |
> +--------+--------+--------+
> +--------+--------+--------+
> $ set hive.optimize.index.filter=false; or set hive.optimize.ppd.storage=false;
> $ select * from ppd where c='apple';
> +-------------+--------+--------+
> |    ppd.c    | ppd.v  | ppd.i  |
> +-------------+--------+--------+
> | apple       | bee    | 1      |
> | apple       | tree   | 2      |
> +-------------+--------+--------+
> {noformat}
> The issue surfaced after uploading the fix for [HIVE-21327|https://issues.apache.org/jira/browse/HIVE-21327] was uploaded upstream. Before the HIVE-21327 fix, setting the parameter 'hive.parquet.timestamp.skip.conversion' to true in the parquet_ppd_char.q test hid this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)