You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Mainak Ghosh (JIRA)" <ji...@apache.org> on 2019/05/08 23:35:00 UTC

[jira] [Created] (HIVE-21709) Count with expression does not work in Parquet

Mainak Ghosh created HIVE-21709:
-----------------------------------

             Summary: Count with expression does not work in Parquet
                 Key: HIVE-21709
                 URL: https://issues.apache.org/jira/browse/HIVE-21709
             Project: Hive
          Issue Type: Bug
    Affects Versions: 2.3.2
            Reporter: Mainak Ghosh


For parquet file with nested schema, count with expression as column name does not work when you are filtering on another column in the same struct. Here are the steps to reproduce:
{code:java}
CREATE TABLE `test_table`( `rtb_win` struct<`impression_id`:string, `pub_id`:string>) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';

INSERT INTO TABLE test_table SELECT named_struct('impression_id', 'cat', 'pub_id', '2');

select count(rtb_win.impression_id) from test_table where rtb_win.pub_id ='2';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+------+ 
| _c0  |
+------+ 
| 0    | 
+------+

select count(*) from test_parquet_count_mghosh where rtb_win.pub_id ='2';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. 
+------+ 
| _c0  | 
+------+ 
| 1    | 
+------+{code}
As you can see the first query returns the wrong result while the second one returns the correct result.

The issue is an column order mismatch between the actual parquet file (impression_id first and pub_id second) and the Hive prunedCols datastructure (reverse). As a result in the filter we compare with the wrong value and the count returns 0. I have been able to identify the cause of this mismatch.

I would love to get the code reviewed and merged. Some of the code changes are changes to commits from Ferdinand Xu and Chao Sun.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)