You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Bennie Can (JIRA)" <ji...@apache.org> on 2015/07/01 01:24:05 UTC

[jira] [Created] (HIVE-11154) Indexing not activated with left outer join and where clause

Bennie Can created HIVE-11154:
---------------------------------

             Summary: Indexing not activated with left outer join and where clause
                 Key: HIVE-11154
                 URL: https://issues.apache.org/jira/browse/HIVE-11154
             Project: Hive
          Issue Type: Bug
          Components: Hive, Indexing, Tez
    Affects Versions: 0.13.1, tez-branch
         Environment: HDInsight
            Reporter: Bennie Can
            Assignee: Bennie Can


I am attempting to optimize a query using indexing. However, indexing is not being activated, and my query takes an impractical amount of time to complete. Note that John Pullokkaran at Hortonworks has had a look at the "explain extended" log and suggested that I file a bug report.
 
Here is how I set up indexing:
Set hive.optimize.index.filter=true;
DROP INDEX IF EXISTS ipv4indexes ON ipv4geotable;
CREATE INDEX ipv4indexes 
ON TABLE ipv4geotable (StartIp, EndIp)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES ('hive.index.compact.binary.search'='true');
ALTER INDEX ipv4indexes ON ipv4geotable REBUILD;
 
And here is my query:
 
DROP TABLE IF EXISTS ipv4table;
CREATE TABLE ipv4table AS
SELECT logon.IP, ipv4.Country
FROM 
(SELECT * FROM logontable WHERE isIpv4(IP)) logon
LEFT OUTER JOIN
(SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP) 
WHERE ipv4.StartIp <= logon.IP AND logon.IP <= ipv4.EndIp;
 
What the query is doing is extracting an IP from logontable and finding in which range it lies within a geolocation table (which is sorted). When a range is found, a country matching the IP range is returned. 

I need to set up indexing in order to conduct a binary search rather than going through millions of rows one at a time. Note that I use the Tez engine, but the problem also occurs with map-reduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)