You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Raviv Murciano-Goroff <ra...@gmail.com> on 2015/08/29 01:49:40 UTC

Join vs. Where...In

Hi,

I often have the following situation: I have a small table with a list of
unique IDs and a very large table of events associated with the IDs. I want
to perform some aggregation including only events associated with IDs from
the small table.

Is there a rule of thumb for whether performing a JOIN on the unique ID is
faster or slower than using WHERE id IN (SELECT id FROM small_table...)?

Thank you for your advice,
Raviv

RE: Join vs. Where...In

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

In RDBMS (say Oracle or Sybase) a hash join comes to play when you tend to join two tables with a large set of information (large being a relative term). The smaller of two tables is hashed into memory.  

 

For Hive, I have two tables one called  t with 1.7 million rows and another called smallt with 100 rows. Using three ways of joining the two tables on object_id column

 

1)    WHERE EXISTS

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from t WHERE EXISTS (select 1 from smallt where t.object_id = smallt.object_id);

+------+--+

| _c0  |

+------+--+

| 100  |

+------+--+

1 row selected (66.369 seconds)

 

2)    IN

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from t where t.object_id IN (select smallt.object_id from smallt);

+------+--+

| _c0  |

+------+--+

| 100  |

+------+--+

1 row selected (66.158 seconds)

 

3)    Classic Join

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from t, smallt where t.object_id = smallt.object_id;

+------+--+

| _c0  |

+------+--+

| 100  |

+------+--+

1 row selected (68.978 seconds)

 

 

You can see the results and judge for yourself

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Raviv Murciano-Goroff [mailto:ravivmg@gmail.com] 
Sent: 29 August 2015 00:50
To: user@hive.apache.org
Subject: Join vs. Where...In

 

Hi,

 

I often have the following situation: I have a small table with a list of unique IDs and a very large table of events associated with the IDs. I want to perform some aggregation including only events associated with IDs from the small table.

 

Is there a rule of thumb for whether performing a JOIN on the unique ID is faster or slower than using WHERE id IN (SELECT id FROM small_table...)?

 

Thank you for your advice,

Raviv