You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/08/13 23:22:19 UTC

[Hadoop Wiki] Update of "Hive/LanguageManual/Joins" by AdamKramer

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/Joins" page has been changed by AdamKramer.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins?action=diff&rev1=20&rev2=21

--------------------------------------------------

  <<TableOfContents>>
  
+ == Join Syntax ==
- ## page was renamed from Hive/LanguageManual/LanguageManual/Joins
- == THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax ==
  Hive supports the following syntax for joining tables:
  
  {{{
@@ -113, +112 @@

    JOIN b ON (a.key = b.key)
    LEFT OUTER JOIN c ON (a.key = c.key)
  }}}
-  . ..first joins a on b, throwing away everything in a or b that does not have a corresponding key in the other table. The reduced table is then joined on c. This provides unintuitive results if there is a key that exists in both a and c, but not b: The whole row (including a.val1, a.val2, and a.key) is dropped in the "a JOIN b" step, so when the result of that is joined with c, any row with a c.key that had a corresponding a.key or b.key (but not both) will show up as NULL, NULL, NULL, c.val.
+  ...first joins a on b, throwing away everything in a or b that does not have a corresponding key in the other table. The reduced table is then joined on c. This provides unintuitive results if there is a key that exists in both a and c but not b: The whole row (including a.val1, a.val2, and a.key) is dropped in the "a JOIN b" step because it is not in b. The result does not have a.key in it, so when it is LEFT OUTER JOINed with c, c.val does not make it in because there is no c.key that matches an a.key (because that row from a was removed). Similarly, if this were a RIGHT OUTER JOIN (instead of LEFT), we would end up with an even weirder effect: NULL, NULL, NULL, c.val, because even though we specified a.key=c.key as the join key, we dropped all rows of a that did not match the first JOIN.
+ 
+ To achieve the more intuitive effect, we should instead do FROM c LEFT OUTER JOIN a ON (c.key = a.key) LEFT OUTER JOIN b ON (c.key = b.key).
  
   * LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient way. Since Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.