You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Bennie Leo <tb...@hotmail.com> on 2015/07/07 01:31:01 UTC

Limiting outer join

 Hi,
 
In the following query, it is possible to limit the amount of entries returned by an outer join to a single value? I want to obtain a single country from ipv4geotable for each entry in logontable. 

CREATE TABLE ipv4table AS
SELECT logon.IP, ipv4.Country
FROM 
(SELECT * FROM logontable WHERE isIpv4(IP)) logon
LEFT OUTER JOIN
(SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP) 
WHERE ipv4.StartIp <= logon.IP AND logon.IP <= ipv4.EndIp;
 
For instance, if I had the IP "W.X.Y.Z" in logontable, and that "W.X.Y.Z" fell in the range of both Italy and Spain in ipv4geotable, then I would like to associate it with Italy only. 
I've tried adding "LIMIT 1" to the second subquery :(SELECT StartIp, EndIp, Country FROM ipv4geotable LIMIT 1) ipv4 ON isIpv4(logon.IP)but this is wrong since the WHERE clause has to traverse all IPs. Limiting the where clause doesn't help either. 
Any ideas?
 
Thank you!
B
 

 		 	   		  

RE: Limiting outer join

Posted by Bennie Leo <tb...@hotmail.com>.
It went from about 60 mins to 3 mins. Hive was traversing the whole table multiple times, which is obviously inefficient!
 
> Date: Tue, 7 Jul 2015 15:55:19 -0700
> Subject: Re: Limiting outer join
> From: gopalv@apache.org
> To: user@hive.apache.org
> 
> 
> > Never mind, I got it working with UDF. I just pass the file location to
> >my evaluate function. Thanks! :)
> 
> Nice. Would be very interested in looking at performance of such a UDF, if
> you have numbers before/after.
> 
> I suspect it will be a magnitude or more faster than the BETWEEN/JOIN
> clauses.
> 
> Cheers,
> Gopal
> 
> 
 		 	   		  

Re: Limiting outer join

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> Never mind, I got it working with UDF. I just pass the file location to
>my evaluate function. Thanks! :)

Nice. Would be very interested in looking at performance of such a UDF, if
you have numbers before/after.

I suspect it will be a magnitude or more faster than the BETWEEN/JOIN
clauses.

Cheers,
Gopal



RE: Limiting outer join

Posted by Bennie Leo <tb...@hotmail.com>.
Never mind, I got it working with UDF. I just pass the file location to my evaluate function. Thanks! :)
 
From: tbenleo@hotmail.com
To: user@hive.apache.org
Subject: RE: Limiting outer join
Date: Tue, 7 Jul 2015 09:59:22 -0700




Thanks for your replies.
 
I see how extracting the first country would work, however I was hoping to speed up my query by stopping the search once a country has been found.
 
Are you suggesting that I pass the whole IP table to a UDF and perform the search myself? I've only programmed simple UDFs so far (ex: reformat a string), so any additional details would be appreciated. I am mostly concerned about importing said table (currently stored in Hive) and distributing the task across nodes (note that I use Tez).
 
Regards,
B
 
> Date: Mon, 6 Jul 2015 18:18:44 -0700
> Subject: Re: Limiting outer join
> From: gopalv@apache.org
> To: user@hive.apache.org
> 
> 
> > In the following query, it is possible to limit the amount of entries
> >returned by an outer join to a single value? I want to obtain a single
> >country from ipv4geotable for each entry in logontable.
> 
> Yes, the PTF DENSE_RANK()/ROW_NUMBER() basically gives you that - you can
> read the first row out of each logon.IP except, there¹s no way to force
> which country wins over the other without an order by country in the
> OVER() clause as well.
> 
> That said, it will only get slower to produce 1 row per group, because of
> the distributed nature of the SQL engine, the reduction of data happens
> after a ordering shuffle.
> 
> You¹re doing range joins in a SQL engine without theta joins and MapReduce
> had no way to implement those at runtime (Tez has, with EdgeManager
> plugins).
> 
> The easiest/traditional approach out of doing geo-IP lookups is a compact
> UDF model without any joins at all.
> 
> There¹s some old threads on discussing this as a built-in & some code
> (with potential licensing issues) -
> http://markmail.org/message/w54j4upwg2wbh3xg
> 
> Cheers,
> Gopal
> 
> 
 		 	   		   		 	   		  

RE: Limiting outer join

Posted by Bennie Leo <tb...@hotmail.com>.
Thanks for your replies.
 
I see how extracting the first country would work, however I was hoping to speed up my query by stopping the search once a country has been found.
 
Are you suggesting that I pass the whole IP table to a UDF and perform the search myself? I've only programmed simple UDFs so far (ex: reformat a string), so any additional details would be appreciated. I am mostly concerned about importing said table (currently stored in Hive) and distributing the task across nodes (note that I use Tez).
 
Regards,
B
 
> Date: Mon, 6 Jul 2015 18:18:44 -0700
> Subject: Re: Limiting outer join
> From: gopalv@apache.org
> To: user@hive.apache.org
> 
> 
> > In the following query, it is possible to limit the amount of entries
> >returned by an outer join to a single value? I want to obtain a single
> >country from ipv4geotable for each entry in logontable.
> 
> Yes, the PTF DENSE_RANK()/ROW_NUMBER() basically gives you that - you can
> read the first row out of each logon.IP except, there¹s no way to force
> which country wins over the other without an order by country in the
> OVER() clause as well.
> 
> That said, it will only get slower to produce 1 row per group, because of
> the distributed nature of the SQL engine, the reduction of data happens
> after a ordering shuffle.
> 
> You¹re doing range joins in a SQL engine without theta joins and MapReduce
> had no way to implement those at runtime (Tez has, with EdgeManager
> plugins).
> 
> The easiest/traditional approach out of doing geo-IP lookups is a compact
> UDF model without any joins at all.
> 
> There¹s some old threads on discussing this as a built-in & some code
> (with potential licensing issues) -
> http://markmail.org/message/w54j4upwg2wbh3xg
> 
> Cheers,
> Gopal
> 
> 
 		 	   		  

Re: Limiting outer join

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> In the following query, it is possible to limit the amount of entries
>returned by an outer join to a single value? I want to obtain a single
>country from ipv4geotable for each entry in logontable.

Yes, the PTF DENSE_RANK()/ROW_NUMBER() basically gives you that - you can
read the first row out of each logon.IP except, there¹s no way to force
which country wins over the other without an order by country in the
OVER() clause as well.

That said, it will only get slower to produce 1 row per group, because of
the distributed nature of the SQL engine, the reduction of data happens
after a ordering shuffle.

You¹re doing range joins in a SQL engine without theta joins and MapReduce
had no way to implement those at runtime (Tez has, with EdgeManager
plugins).

The easiest/traditional approach out of doing geo-IP lookups is a compact
UDF model without any joins at all.

There¹s some old threads on discussing this as a built-in & some code
(with potential licensing issues) -
http://markmail.org/message/w54j4upwg2wbh3xg

Cheers,
Gopal