You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Raihan Jamal <ja...@gmail.com> on 2012/07/08 00:20:59 UTC

Joining two tables using HiveQL

Hi Everyone,

  CREATE EXTERNAL TABLE IF NOT EXISTS Table1   (This is the MAIN table
through which comparisons need to be made)
    (
    BUYER_ID BIGINT,
    ITEM_ID BIGINT,
    CREATED_TIME STRING
    )

 And this is the data in the above first table

*    **BUYER_ID**  |  **ITEM_ID**         |      **CREATED_TIME**   *
*
----------------------+---------------------------+-----------------------------------------
*
*     1015826235        220003038067           2012-06-21 07:57:39 *
*     1015826235        300003861266           2012-06-21 21:11:12 *
*     1015826235        140002997245           2012-06-14 20:10:16 *
*     1015826235        200002448035           2012-06-08 22:02:17 *
*     1015826235      *260003553381*         *2002-01-30 23:12:18* *

This is Second table in Hive- It also contains information about the items
we are purchasing.

    CREATE EXTERNAL TABLE IF NOT EXISTS Table2
    (
    USER_ID BIGINT,
    PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
    )

And this is the data in the above second table (Table2)-

    ***USER_ID**         |       **PURCHASED_ITEM***
*
-------------------------------+---------------------------------------------
*
*    1015826235
 [{"product_id":220003038067,"timestamps":"1340321132000"},
 {"product_id":300003861266,"timestamps":"1340271857000"},     *
*
{"product_id":140002997245,"timestamps":"1339694926000"},
 {"product_id":200002448035,"timestamps":"1339172659000"}]*
 **Compare Table2 with Table1 on USER_ID and BUYER_ID.**
 Find the `BUYER_ID(USER_ID)` and as well as those `ITEM_ID` and
`CREATED_TIME` which are missing from `Table2` after comparing from
`Table1` on `BUYER_ID`.

So If you look `Table2` data this(last) `ITEM_ID(PRODUCT_ID) 260003553381`
and `CREATED_TIME(TIMESTAMPS)  2002-01-30 23:12:18` is missing from Table2.
So I want to show the below result using the HiveQL query.
 * **BUYER_ID or USER_ID**      |             **ITEM_ID**       |
**CREATED_TIME**         |          **TIMESTAMP_OF_CREATED_TIME***
*
-----------------------------------------------------+------------------------------------+----------------------------------------------+----------------------------------------------------------------
*
*              1015826235                                260003553381
           2002-01-30 23:12:18                                1012457538*


Can anyone help me with this. As I am new to HiveQL so having lot of
Problem. I wrote the below HiveQL, but it is not working the way I wanted
to. Can anyone suggest me what wrong I am doing wrong?

**Updated:-**

I have written this query, but it doesn't working the way I wanted to.

    SELECT
      Table1.buyer_id, Table1.item_id, Table1.created_time,
      UNIX_TIMESTAMP(Table1.created_time)
    FROM ( SELECT user_id,prodID
           FROM Table2 test2
           LATERAL VIEW explode( test2.purchased_item.product_id )
prodTable AS prodID
         ) prodTable
    RIGHT OUTER JOIN Table1 ON prodTable.user_id = Table1.buyer_id;




*Raihan Jamal*