You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Himanish Kushary <hi...@gmail.com> on 2012/08/18 04:50:54 UTC

Multiple Join giving different results compared to separate joins using intermediate table

Hi,

I am facing a weird issue.Probably I am missing something.Could somebody
please guide me.

I have three tables in hive.

Table A - clientid,shopid,shopposition,shopdate,shopitemid,shopitemdescr
(partitioned by date)
Table B- shopid,shopposition (distinct shopid,shopposition from another
table)
Table C - clientid,clientname,clientcity

I would like to join the tables above to find the following information for
a date
- shopid,shopposition,shopdate,shopitemid,shopitemdescr,clientname,clientcity

I wrote something like - select <<above fields>> from A a join B b on
(a.shopid=b.shopid and a.shopposition=b.shopposition) join C
c(c.clientid=a.clientid)

The result is not what I expected and has lot of duplicates.

Whereas if I create a seperate table from the first join - create table D
as select <<relevant fields>> from A a join B b on (a.shopid=b.shopid and
a.shopposition=b.shopposition)

and then join with C - select <<fields>> from D d join C c on (
d.clientid=c.clientid) the results are as expected.

For example if Table A has 8 rows then following the second approach gives
me 8 rows with proper fields but the the former approach I get lot of rows
( probably a cartesian product)

I am confused about whats going wrong with the first approach,could
somebody throw some light please.

---------------------------
Thanks & Regards
Himanish

Re: Multiple Join giving different results compared to separate joins using intermediate table

Posted by Himanish Kushary <hi...@gmail.com>.

Thanks Vinod.Why would the results vary ? Could you please give me some
pointers on why hive will treat them as different ?

On Sat, Aug 18, 2012 at 4:29 AM, Vinod Singh <vi...@vinodsingh.com> wrote:

> You may have to rewrite the query as-
>
> select <<above fields>> from
> (select <<above fields>> from A a join B b on (a.shopid=b.shopid and
> a.shopposition=b.shopposition)) D join C c(c.clientid=a.clientid)
>
> That will give results as per your expectation.
>
> Thanks,
> Vinod
>
>
> On Sat, Aug 18, 2012 at 8:20 AM, Himanish Kushary <hi...@gmail.com>wrote:
>
>> Hi,
>>
>> I am facing a weird issue.Probably I am missing something.Could somebody
>> please guide me.
>>
>> I have three tables in hive.
>>
>> Table A - clientid,shopid,shopposition,shopdate,shopitemid,shopitemdescr
>> (partitioned by date)
>> Table B- shopid,shopposition (distinct shopid,shopposition from another
>> table)
>> Table C - clientid,clientname,clientcity
>>
>> I would like to join the tables above to find the following information
>> for a date
>> - shopid,shopposition,shopdate,shopitemid,shopitemdescr,clientname,clientcity
>>
>> I wrote something like - select <<above fields>> from A a join B b on
>> (a.shopid=b.shopid and a.shopposition=b.shopposition) join C
>> c(c.clientid=a.clientid)
>>
>> The result is not what I expected and has lot of duplicates.
>>
>> Whereas if I create a seperate table from the first join - create table D
>> as select <<relevant fields>> from A a join B b on (a.shopid=b.shopid and
>> a.shopposition=b.shopposition)
>>
>> and then join with C - select <<fields>> from D d join C c on (
>> d.clientid=c.clientid) the results are as expected.
>>
>> For example if Table A has 8 rows then following the second approach
>> gives me 8 rows with proper fields but the the former approach I get lot of
>> rows ( probably a cartesian product)
>>
>> I am confused about whats going wrong with the first approach,could
>> somebody throw some light please.
>>
>> ---------------------------
>> Thanks & Regards
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish

Re: Multiple Join giving different results compared to separate joins using intermediate table

Posted by Vinod Singh <vi...@vinodsingh.com>.

You may have to rewrite the query as-

select <<above fields>> from
(select <<above fields>> from A a join B b on (a.shopid=b.shopid and
a.shopposition=b.shopposition)) D join C c(c.clientid=a.clientid)

That will give results as per your expectation.

Thanks,
Vinod

On Sat, Aug 18, 2012 at 8:20 AM, Himanish Kushary <hi...@gmail.com>wrote:

> Hi,
>
> I am facing a weird issue.Probably I am missing something.Could somebody
> please guide me.
>
> I have three tables in hive.
>
> Table A - clientid,shopid,shopposition,shopdate,shopitemid,shopitemdescr
> (partitioned by date)
> Table B- shopid,shopposition (distinct shopid,shopposition from another
> table)
> Table C - clientid,clientname,clientcity
>
> I would like to join the tables above to find the following information
> for a date
> - shopid,shopposition,shopdate,shopitemid,shopitemdescr,clientname,clientcity
>
> I wrote something like - select <<above fields>> from A a join B b on
> (a.shopid=b.shopid and a.shopposition=b.shopposition) join C
> c(c.clientid=a.clientid)
>
> The result is not what I expected and has lot of duplicates.
>
> Whereas if I create a seperate table from the first join - create table D
> as select <<relevant fields>> from A a join B b on (a.shopid=b.shopid and
> a.shopposition=b.shopposition)
>
> and then join with C - select <<fields>> from D d join C c on (
> d.clientid=c.clientid) the results are as expected.
>
> For example if Table A has 8 rows then following the second approach gives
> me 8 rows with proper fields but the the former approach I get lot of rows
> ( probably a cartesian product)
>
> I am confused about whats going wrong with the first approach,could
> somebody throw some light please.
>
> ---------------------------
> Thanks & Regards
> Himanish
>