You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Marta Kuczora via Review Board <no...@reviews.apache.org> on 2019/05/09 07:51:26 UTC

Re: Review Request 70474: HIVE-21407: Parquet predicate pushdown is not working correctly for char column types

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70474/
-----------------------------------------------------------

(Updated May 9, 2019, 7:51 a.m.)


Review request for hive and Peter Vary.


Changes
-------

Fixed the whitespace issue.


Bugs: HIVE-21407
    https://issues.apache.org/jira/browse/HIVE-21407


Repository: hive-git


Description
-------

The idea behind the patch is that for CHAR columns extend the predicate which is pushed to Parquet with an “or” clause which contains the same expression with a padded and a stripped value.
Example:
column c is a CHAR(10) type and the search expression is c='apple'
The predicate which is pushed to Parquet looked like c='apple ' before the patch and it would look like (c='apple ' or c='apple') after the patch.
Since the value 'apple' is stored in Parquet without padding, the predicate before the patch didn’t return any rows. With the patch it will return the correct row. 
Since on predicate level, there is no distinction between CHAR or VARCHAR, the predicates for VARCHARs will be changed as well, so the result set returned from Parquet will be wider than before.
Example:
A table contains a c VARCHAR(10) column and there is a row where c='apple' and there is an other row where c='apple '. If the search expression is c='apple ', both rows will be returned from Parquet after the patch. But since Hive is doing an additional filtering on the rows returned from Parquet, it won’t be a problem, the result set returned by Hive will contain only the row with the value 'apple '.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java be4c0d5 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRecordReaderWrapper.java 0210a0a 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/read/TestParquetFilterPredicate.java d464046 
  ql/src/test/queries/clientpositive/parquet_ppd_char.q 4230d8c 
  ql/src/test/queries/clientpositive/parquet_ppd_char2.q PRE-CREATION 
  ql/src/test/results/clientpositive/parquet_ppd_char2.q.out PRE-CREATION 


Diff: https://reviews.apache.org/r/70474/diff/2/

Changes: https://reviews.apache.org/r/70474/diff/1-2/


Testing
-------

Added new q test for testing the PPD for char and varchar types. Also extended the unit tests for the ParquetFilterPredicateConverter.toFilterPredicate method.

The TestParquetRecordReaderWrapper and the TestParquetFilterPredicate are both testing the same thing, the behavior of the ParquetFilterPredicateConverter.toFilterPredicate method. It doesn't make sense to have tests for the same use case in different test classes, so moved the test cases from the TestParquetRecordReaderWrapper to TestParquetFilterPredicate.


Thanks,

Marta Kuczora


Re: Review Request 70474: HIVE-21407: Parquet predicate pushdown is not working correctly for char column types

Posted by Peter Vary via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70474/#review215157
-----------------------------------------------------------


Ship it!




Ship It!

- Peter Vary


On máj. 9, 2019, 7:51 de, Marta Kuczora wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70474/
> -----------------------------------------------------------
> 
> (Updated máj. 9, 2019, 7:51 de)
> 
> 
> Review request for hive and Peter Vary.
> 
> 
> Bugs: HIVE-21407
>     https://issues.apache.org/jira/browse/HIVE-21407
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The idea behind the patch is that for CHAR columns extend the predicate which is pushed to Parquet with an “or” clause which contains the same expression with a padded and a stripped value.
> Example:
> column c is a CHAR(10) type and the search expression is c='apple'
> The predicate which is pushed to Parquet looked like c='apple ' before the patch and it would look like (c='apple ' or c='apple') after the patch.
> Since the value 'apple' is stored in Parquet without padding, the predicate before the patch didn’t return any rows. With the patch it will return the correct row. 
> Since on predicate level, there is no distinction between CHAR or VARCHAR, the predicates for VARCHARs will be changed as well, so the result set returned from Parquet will be wider than before.
> Example:
> A table contains a c VARCHAR(10) column and there is a row where c='apple' and there is an other row where c='apple '. If the search expression is c='apple ', both rows will be returned from Parquet after the patch. But since Hive is doing an additional filtering on the rows returned from Parquet, it won’t be a problem, the result set returned by Hive will contain only the row with the value 'apple '.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java be4c0d5 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRecordReaderWrapper.java 0210a0a 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/read/TestParquetFilterPredicate.java d464046 
>   ql/src/test/queries/clientpositive/parquet_ppd_char.q 4230d8c 
>   ql/src/test/queries/clientpositive/parquet_ppd_char2.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_ppd_char2.q.out PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/70474/diff/2/
> 
> 
> Testing
> -------
> 
> Added new q test for testing the PPD for char and varchar types. Also extended the unit tests for the ParquetFilterPredicateConverter.toFilterPredicate method.
> 
> The TestParquetRecordReaderWrapper and the TestParquetFilterPredicate are both testing the same thing, the behavior of the ParquetFilterPredicateConverter.toFilterPredicate method. It doesn't make sense to have tests for the same use case in different test classes, so moved the test cases from the TestParquetRecordReaderWrapper to TestParquetFilterPredicate.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>