You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Peter Vary (Jira)" <ji...@apache.org> on 2021/03/08 13:51:00 UTC
[jira] [Commented] (HIVE-24833) Hive Uses Fetch Task For
HBaseStorageHandler Scans
[ https://issues.apache.org/jira/browse/HIVE-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297361#comment-17297361 ]
Peter Vary commented on HIVE-24833:
-----------------------------------
We will have the same issue with the Iceberg tables too. CC: [~mbod], [~lpinter]
> Hive Uses Fetch Task For HBaseStorageHandler Scans
> --------------------------------------------------
>
> Key: HIVE-24833
> URL: https://issues.apache.org/jira/browse/HIVE-24833
> Project: Hive
> Issue Type: Improvement
> Components: HBase Handler
> Reporter: David Mollitor
> Priority: Major
>
> I believe that a Hive query with an HBase Storage Handler is incorrectly applies a predicate pushdown into the storage handler.
> I observed a FETCH optimization that took a long time to complete because it was performing a table scan across the entire HBase table.
> The only case in which a predicate should be pushed down the storage layer is for
> {code:sql}
> SELECT * FROM TABLE my_hbase_table WHERE row_key=?
> {code}
> This would be appropriate (EQ on the row key). Anything else will involve a scan of the table and there is no way to easily calculate how small a scan it will require and therefore should always be passed to the compute engine (Tez).
> {code:none}
> beeline> CREATE EXTERNAL TABLE t_hbase(key STRING,
> tinyint_col TINYINT,
> smallint_col SMALLINT,
> int_col INT,
> bigint_col BIGINT,
> float_col FLOAT,
> double_col DOUBLE,
> boolean_col BOOLEAN)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:binarykey#-,cf:binarybyte#-,cf:binaryshort#-,:key#-,cf:binarylong#-,cf:binaryfloat#-,cf:binarydouble#-,cf:binaryboolean#-")
> TBLPROPERTIES ("hbase.table.name" = "t_hive",
> "hbase.table.default.storage.type" = "binary",
> "external.table.purge" = "true")
>
>
>
> beeline> insert into table t_hbase values ('user1', 1, 11, 10, 1, 1.0, 1.0, true);
> beeline> explain select * from t_hbase where int_col=10;
> Explain
> Plan optimized by CBO.
> Stage-0
> Fetch Operator
> limit:-1
> Select Operator [SEL_2]
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
> TableScan [TS_0]
> Output:
> ["key","tinyint_col","smallint_col","bigint_col","float_col","double_col","boolean_col"]
> {code}
> You can see for this trivial example that this is an {{external}} table, Hive has no idea what's in it, but it is doing a Fetch + TableScan
--
This message was sent by Atlassian Jira
(v8.3.4#803005)