You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "David Mollitor (Jira)" <ji...@apache.org> on 2020/06/10 15:14:00 UTC
[jira] [Commented] (HIVE-21709) Count with expression does not work in Parquet

    [ https://issues.apache.org/jira/browse/HIVE-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130776#comment-17130776 ] 

David Mollitor commented on HIVE-21709:
---------------------------------------

Still interested in working on this?

Can you please create PR against master?

> Count with expression does not work in Parquet
> ----------------------------------------------
>
>                 Key: HIVE-21709
>                 URL: https://issues.apache.org/jira/browse/HIVE-21709
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.2
>            Reporter: Mainak Ghosh
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> For parquet file with nested schema, count with expression as column name does not work when you are filtering on another column in the same struct. Here are the steps to reproduce:
> {code:java}
> CREATE TABLE `test_table`( `rtb_win` struct<`impression_id`:string, `pub_id`:string>) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> INSERT INTO TABLE test_table SELECT named_struct('impression_id', 'cat', 'pub_id', '2');
> select count(rtb_win.impression_id) from test_table where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
> +------+ 
> | _c0  |
> +------+ 
> | 0    | 
> +------+
> select count(*) from test_parquet_count_mghosh where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. 
> +------+ 
> | _c0  | 
> +------+ 
> | 1    | 
> +------+{code}
> As you can see the first query returns the wrong result while the second one returns the correct result.
> The issue is an column order mismatch between the actual parquet file (impression_id first and pub_id second) and the Hive prunedCols datastructure (reverse). As a result in the filter we compare with the wrong value and the count returns 0. I have been able to identify the cause of this mismatch.
> I would love to get the code reviewed and merged. Some of the code changes are changes to commits from Ferdinand Xu and Chao Sun.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)