You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Marta Kuczora (JIRA)" <ji...@apache.org> on 2019/03/07 13:28:00 UTC

[jira] [Updated] (HIVE-21407) Parquet predicate pushdown is not working correctly for char column types

     [ https://issues.apache.org/jira/browse/HIVE-21407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marta Kuczora updated HIVE-21407:
---------------------------------
    Description: 
If the 'hive.optimize.index.filter' parameter is false, the filter predicate is not pushed to parquet, so the filtering only happens within Hive. If the parameter is true, the filter is pushed to parquet, but for a char type, the value which is pushed to Parquet will be padded with spaces:
{noformat}
  @Override
  public void setValue(String val, int len) {
    super.setValue(HiveBaseChar.getPaddedValue(val, len), -1);
  }
{noformat} 
So if we have a char(10) column which contains the value "apple" and the where condition looks like 'where c='apple'', the value pushed to Paquet will be 'apple' followed by 5 spaces. But the stored values are not padded, so no rows will be returned from Parquet.

How to reproduce:
{noformat}
$ create table ppd (c char(10), v varchar(10), i int) stored as parquet;
$ insert into ppd values ('apple', 'bee', 1),('apple', 'tree', 2),('hello', 'world', 1),('hello','vilag',3);
$ set hive.optimize.ppd.storage=true;
$ set hive.vectorized.execution.enabled=true;
$ set hive.vectorized.execution.enabled=false;
$ set hive.optimize.ppd=true;
$ set hive.optimize.index.filter=true;
$ set hive.parquet.timestamp.skip.conversion=false;
$ select * from ppd where c='apple';
+--------+--------+--------+
| ppd.c  | ppd.v  | ppd.i  |
+--------+--------+--------+
+--------+--------+--------+
$ set hive.optimize.index.filter=false; or set hive.optimize.ppd.storage=false;
$ select * from ppd where c='apple';
+-------------+--------+--------+
|    ppd.c    | ppd.v  | ppd.i  |
+-------------+--------+--------+
| apple       | bee    | 1      |
| apple       | tree   | 2      |
+-------------+--------+--------+
{noformat}

The issue surfaced after uploading the fix for [HIVE-21327|https://issues.apache.org/jira/browse/HIVE-21327] was uploaded upstream. Before the HIVE-21327 fix, setting the parameter 'hive.parquet.timestamp.skip.conversion' to true in the parquet_ppd_char.q test hid this issue.


> Parquet predicate pushdown is not working correctly for char column types
> -------------------------------------------------------------------------
>
>                 Key: HIVE-21407
>                 URL: https://issues.apache.org/jira/browse/HIVE-21407
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Marta Kuczora
>            Priority: Major
>
> If the 'hive.optimize.index.filter' parameter is false, the filter predicate is not pushed to parquet, so the filtering only happens within Hive. If the parameter is true, the filter is pushed to parquet, but for a char type, the value which is pushed to Parquet will be padded with spaces:
> {noformat}
>   @Override
>   public void setValue(String val, int len) {
>     super.setValue(HiveBaseChar.getPaddedValue(val, len), -1);
>   }
> {noformat} 
> So if we have a char(10) column which contains the value "apple" and the where condition looks like 'where c='apple'', the value pushed to Paquet will be 'apple' followed by 5 spaces. But the stored values are not padded, so no rows will be returned from Parquet.
> How to reproduce:
> {noformat}
> $ create table ppd (c char(10), v varchar(10), i int) stored as parquet;
> $ insert into ppd values ('apple', 'bee', 1),('apple', 'tree', 2),('hello', 'world', 1),('hello','vilag',3);
> $ set hive.optimize.ppd.storage=true;
> $ set hive.vectorized.execution.enabled=true;
> $ set hive.vectorized.execution.enabled=false;
> $ set hive.optimize.ppd=true;
> $ set hive.optimize.index.filter=true;
> $ set hive.parquet.timestamp.skip.conversion=false;
> $ select * from ppd where c='apple';
> +--------+--------+--------+
> | ppd.c  | ppd.v  | ppd.i  |
> +--------+--------+--------+
> +--------+--------+--------+
> $ set hive.optimize.index.filter=false; or set hive.optimize.ppd.storage=false;
> $ select * from ppd where c='apple';
> +-------------+--------+--------+
> |    ppd.c    | ppd.v  | ppd.i  |
> +-------------+--------+--------+
> | apple       | bee    | 1      |
> | apple       | tree   | 2      |
> +-------------+--------+--------+
> {noformat}
> The issue surfaced after uploading the fix for [HIVE-21327|https://issues.apache.org/jira/browse/HIVE-21327] was uploaded upstream. Before the HIVE-21327 fix, setting the parameter 'hive.parquet.timestamp.skip.conversion' to true in the parquet_ppd_char.q test hid this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)