You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Raihan Jamal <ja...@gmail.com> on 2012/07/06 01:26:39 UTC

HiveQL Joins- Hadoop

Can anyone help me with the below query?


*
*

*CREATE EXTERNAL TABLE IF NOT EXISTS Table1   (This is the MAIN table
through which comparisons need to be made)*

*    (*

*    BUYER_ID BIGINT,*

*    ITEM_ID BIGINT,*

*    CREATED_TIME STRING*

*    )*



And this is the data in the above first table

* *

*    **BUYER_ID**        **ITEM_ID**            **CREATED_TIME**   *

*     1015826235              220003038067                  2012-06-21
07:57:39 *

*     1015826235              300003861266                  2012-06-21
21:11:12 *

*     1015826235             140002997245                  2012-06-14
20:10:16 *

*     1015826235             200002448035                 2012-06-08
22:02:17 *

*     1015826235         *260003553381*           *2002-01-30 23:12:18* *



This is Second table in Hive- It also contains information about the items
we are purchasing.



    *CREATE EXTERNAL TABLE IF NOT EXISTS Table2*

*    (*

*    USER_ID BIGINT,*

*    PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>*

*    )*



And this is the data in the above second table-



    ***USER_ID**    **PURCHASED_ITEM***

*    1015826235
 [{"product_id":220003038067,"timestamps":"1340321132000"},
{"product_id":300003861266,"timestamps":"1340271857000"},     *

*
 {"product_id":140002997245,"timestamps":"1339694926000"},**
{"product_id":200002448035,"timestamps":"1339172659000"}]*





*Compare Table2 with Table1 on USER_ID and BUYER_ID.*





*Find the `BUYER_ID(USER_ID)` and as well as those `ITEM_ID` and
`CREATED_TIME` which are missing from `Table2` after comparing from
`Table1` on `BUYER_ID`.*



So If you look `Table2` data this `ITEM_ID(PRODUCT_ID) 260003553381` and
`CREATED_TIME(TIMESTAMPS)  2002-01-30 23:12:18` is missing from Table2. So
I want to show the below result using the HiveQL query.



***BUYER_ID or USER_ID**                   **ITEM_ID**
   **CREATED_TIME**                    **TIMESTAMP_OF_CREATED_TIME***

*      1015826235                                 260003553381
                  2002-01-30 23:12:18
   1012457538*


Below is the query I wrote, I am not sure whether it is right or not as
HiveQL is totally new to me. Can anyone help with this HiveQL query?

select BUYER_ID, ITEM_ID, CREATED_TIME, UNIX_TIMESTAMP(CREATED_TIME) from
table1 where table1.user_id = table2.buyer_id and (table1.item_id <>
table2.product_id or UNIX_TIMESTAMP(table1.created_time) <>
table2.timestamps)


*Raihan Jamal*