You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Craig Condit (JIRA)" <ji...@apache.org> on 2014/05/27 19:36:03 UTC

[jira] [Commented] (HIVE-1643) support range scans and non-key columns in HBase filter pushdown

    [ https://issues.apache.org/jira/browse/HIVE-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009938#comment-14009938 ] 

Craig Condit commented on HIVE-1643:
------------------------------------

The patch as-is has a few issues...

First, at least in Hive 0.12, it interacts badly when multiple tables are joined. I've seen cases where it was clear that Hive was attempting to push down predicates for the wrong table, leading to NullPointerExceptions when the column is looked up and not found since the HBase storage handler assumes that any predicate that it receives will be for a valid column. I suspect this must be a bug in the query optimizer, but have not been able to determine exactly where.

Second, the fallback behavior when a complex query predicate is passed down is to punt on the entire expression, even if it could be partially evaluated (for example rowkey >= 'A' AND rowkey < 'B' AND ([complex bit])). This leads to unexpected full table scans in HBase. At the very least, the code should try really hard to at least handle the rowkey parts if possible. This can happen unexpectedly, if a single term uses an operator that the storage handler does not have a case for.

Third, even if the predicate pushdown works, this often results in secondary issues when interacting with HBase. In a case where no rowkey expression exists, it's possible to run very high CPU usage on HBase to evaluate the filters, and even get HBase RPC timeouts if enough rows are filtered out to result in no data being returned quickly enough. It would be nice to be able to control (somehow) which expressions the code tries to push down.

At our location, we didn't even try to port the patch to Hive 0.13 when we upgraded, mainly due to issues #2 and #3. Fortunately, CTEs have allowed us to ensure that only rowkey predicates get pushed down like so:

{noformat}
with a as (select ... from hbase_table where rowkey >= 'start' and rowkey < 'end') do select * from a where ...;
{noformat}

It might be more useful for Hive-HBase integration to focus on ensuring that rowkey predicates are always pushed down (except for things like OR/NOT expressions, etc.) rather than trying to push down other types of expressions.



> support range scans and non-key columns in HBase filter pushdown
> ----------------------------------------------------------------
>
>                 Key: HIVE-1643
>                 URL: https://issues.apache.org/jira/browse/HIVE-1643
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>    Affects Versions: 0.9.0
>            Reporter: John Sichi
>            Assignee: bharath v
>              Labels: patch
>         Attachments: HIVE-1643.patch, Hive-1643.2.patch, hbase_handler.patch
>
>
> HIVE-1226 added support for WHERE rowkey=3.  We would like to support WHERE rowkey BETWEEN 10 and 20, as well as predicates on non-rowkeys (plus conjunctions etc).  Non-rowkey conditions can't be used to filter out entire ranges, but they can be used to push the per-row filter processing as far down as possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)