You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2022/03/10 04:18:00 UTC

[jira] [Comment Edited] (HIVE-25845) Support ColumnIndexes for Parq files

    [ https://issues.apache.org/jira/browse/HIVE-25845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503985#comment-17503985 ] 

Rajesh Balamohan edited comment on HIVE-25845 at 3/10/22, 4:17 AM:
-------------------------------------------------------------------

Created PR for this. This should take care of HIVE-26013 as well.
{noformat}
With TPCH @ 100 GB scale (customer table has 15000000 records), here is the query which was tried out.

Note that with patch, it reads just 11 MB of data instead of 94 MB without patch. This will be highly beneficial for cloud stores.

select count(*) from customer where c_custkey > 14099999 and c_custkey < 14850991;

Without Patch:
===========
INFO  : File System Counters:
..
INFO  :    FILE_BYTES_WRITTEN: 4703
INFO  :    HDFS_BYTES_READ: 94174097
...
..

With Patch:
===========
INFO  : File System Counters:
....
INFO  :    HDFS_BYTES_READ: 11777945
...{noformat}
 


was (Author: rajesh.balamohan):
Created PR for this. This should take care of HIVE-26013 as well.
{noformat}
With TPCH @ 100 GB scale (customer table has 15000000 records), here is the query which was tried out.

Note that with patch, it reads just 11 MB of data instead of 94 MB without patch. This will be highly beneficial for cloud stores.

select count(*) from customer where c_custkey > 14099999 and c_custkey < 14850991 limit 1000;

Without Patch:
===========
INFO  : File System Counters:
..
INFO  :    FILE_BYTES_WRITTEN: 4703
INFO  :    HDFS_BYTES_READ: 94174097
...
..

With Patch:
===========
INFO  : File System Counters:
....
INFO  :    HDFS_BYTES_READ: 11777945
...{noformat}
 

> Support ColumnIndexes for Parq files
> ------------------------------------
>
>                 Key: HIVE-25845
>                 URL: https://issues.apache.org/jira/browse/HIVE-25845
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/PARQUET-1201
>  
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L271-L273]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)