You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ramasubramanian Narayanan <ra...@gmail.com> on 2017/08/23 13:37:38 UTC

LEFT JOIN and WHERE CLAUSE - How to handle

Hi,

Need your suggestion on the below.

Have two tables TXN and CURRENCY.

Need all records in TXN and hence doing Left Join with CURRENCY.

*Two problems :*
1. CURRENCY table may contain duplicate records hence it needs to be
handled through RANK or some other function.
2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
clause' then we will loose the EUR records which should not happen.

Please suggest a solution to over come both the problems. For duplicated
records it is fine if we select any of the CNTRY_DESC.

*Table : CURRENCY*

*Table : TXN*

*CCY_CD*

*CNTRY_DESC*

*EFF_ST_DT*

*EFF_END_DT*

*ROW_NUM*

*CCY_CD*

*TXN_DT*

INR

Indian Rupee

1-Jan-15

20-Feb-16

1

INR

16-Feb-17

INR

Indian Rupee New

21-Feb-16

20-Feb-99

2

USD

16-Feb-17

USD

US Dollar

1-Jan-15

20-Feb-16

3

SGD

16-Feb-17

SGD

Singapore Dollar

1-Jan-15

20-Feb-17

4

EUR

16-Feb-17

SGD

Singapore Dollar New

15-Feb-17

20-Feb-99

SGD

Singapore Dollar Latest

16-Feb-17

16-Feb-17

*Expected Output*

*ROW_NUM*

*CCY_CD*

*TXN_DT*

*CNTRY_DESC*

1

INR

16-Feb-17

Indian Rupee

2

USD

16-Feb-17

US Dollar

3

SGD

16-Feb-17

Singapore Dollar Latest (Any of three valid valid is fine)

4

EUR

16-Feb-17

<Null>



*Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
where
TXN_DT between EFF_ST_DT and EFF_END_DT;



This query will drop the "EUR" record because of the where clause used.
It cannot be handled with case statement instead of 'where clause' as we
have   more than one record for 'SGD' when  TXN_DT is 16-FEB.

regards,
Rams

Re: LEFT JOIN and WHERE CLAUSE - How to handle

Posted by Ramasubramanian Narayanan <ra...@gmail.com>.

Hi,

My bad, you are right I intended to give the query  as below (TXN left
joining CURRENCY not the other way around):

*Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
from TXN LEFT JOIN CURRENCY  on (CURRENCY.CCY_CD = TXN.CCY_CD)
where
TXN_DT between EFF_ST_DT and EFF_END_DT;

Thanks and regards,
Rams


On Wed, Aug 23, 2017 at 9:17 PM, Furcy Pin <fu...@flaminem.com> wrote:

> Ho, in that case...
>
> (First I notice that you say you want all records in TXN but in the query
> you give, you perform your join the other way round.)
>
> This is a typical use case that SQL is not very good at handling...
>
> The solutions I see are:
>
> - use RANK as you suggested.
>   Note that Hive is smart enough to optimize it correctly :
>   if you use a filter WHERE Rank() < K, it will take the K first values in
> RAM and ignore the rest, rather than ranking everything and filtering
> afterwards.
>
> - perform a GROUP BY TXN.ROW_NUM so you have only one line per
> transaction. Aggregate the rest the way you like (max, max_by or arbitrary)
>   To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
> to your where clause (and reverse the left join).
>
> - Most complicated way to implement, but simplest to use afterwards:
>   replace your CURRENCY table with a table where you have only one row per
> currency, and store the multiple start/end dates in an array of structs,
>   then a UDF that given such array and a date returns the correct
> description. It only works if you don't have too many description per
> currency though.
>   This is where Spark comes handy: it is simpler to define UDFs with it.
>
>
>
> On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
> ramasubramanian.narayanan@gmail.com> wrote:
>
>> Hi,
>>
>> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
>> It needs to be equated.
>>
>>
>> regards,
>> Rams
>>
>> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <fu...@flaminem.com>
>> wrote:
>>
>>> I would suggest to use a subquery
>>>
>>> WITH unique_currency AS (
>>>   SELECT
>>>     CCY_CD,
>>>     MAX(CNTRY_DESC) as CNTRY_DESC
>>>   FROM CURRENCY
>>>   GROUP BY CCY_CD
>>> )
>>>
>>> and then perform your left join on it.
>>>
>>> Some SQL engine (e.g. Presto) have aggregation functions like
>>> arbitrary(col) that take any value and are a little less costly than a
>>> max.
>>> Sometimes, they also have functions like max_by(x, y)
>>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>>> would allow you to get the most recent description.
>>>
>>> It is a shame that this function is not included in Hive yet, but still
>>> you can find some UDAF implementations on github
>>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>>> .
>>>
>>>
>>>
>>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>>> ramasubramanian.narayanan@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Need your suggestion on the below.
>>>>
>>>> Have two tables TXN and CURRENCY.
>>>>
>>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>>
>>>> *Two problems :*
>>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>>> handled through RANK or some other function.
>>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>>> clause' then we will loose the EUR records which should not happen.
>>>>
>>>> Please suggest a solution to over come both the problems. For
>>>> duplicated records it is fine if we select any of the CNTRY_DESC.
>>>>
>>>> *Table : CURRENCY*
>>>>
>>>> *Table : TXN*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> *EFF_ST_DT*
>>>>
>>>> *EFF_END_DT*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> INR
>>>>
>>>> Indian Rupee
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> INR
>>>>
>>>> Indian Rupee New
>>>>
>>>> 21-Feb-16
>>>>
>>>> 20-Feb-99
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> USD
>>>>
>>>> US Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-17
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar New
>>>>
>>>> 15-Feb-17
>>>>
>>>> 20-Feb-99
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar Latest
>>>>
>>>> 16-Feb-17
>>>>
>>>> 16-Feb-17
>>>>
>>>> *Expected Output*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> Indian Rupee
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> US Dollar
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> Singapore Dollar Latest (Any of three valid valid is fine)
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> <Null>
>>>>
>>>>
>>>>
>>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>>>> where
>>>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>>>
>>>>
>>>>
>>>> This query will drop the "EUR" record because of the where clause used.
>>>> It cannot be handled with case statement instead of 'where clause' as
>>>> we have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>>>>
>>>> regards,
>>>> Rams
>>>>
>>>
>>>
>>
>

Re: LEFT JOIN and WHERE CLAUSE - How to handle

Posted by peter zhang <pe...@gmail.com>.

How about splitting your txn data into two parts, one for the tx that has
currency info (just use join) and the other part for the tx that can't find
currency info then use a union all operator combines two parts tx, as below:

SELECT ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
  FROM CURRENCY JOIN TXN ON (CURRENCY.CCY_CD = TXN.CCY_CD)
 WHERE TXN_DT BETWEEN EFF_ST_DT AND EFF_END_DT;
 UNION ALL
SELECT ROW_NUM,CCY_CD,TXN_DT, null AS CNTRY_DESC
  FROM TXN
 WHERE NOT EXISTS (SELECT 1 FROM CURRENCY WHERE CURRENCY.CCY_CD =
TXN.CCY_CD)

2017-08-23 23:47 GMT+08:00 Furcy Pin <fu...@flaminem.com>:

> Ho, in that case...
>
> (First I notice that you say you want all records in TXN but in the query
> you give, you perform your join the other way round.)
>
> This is a typical use case that SQL is not very good at handling...
>
> The solutions I see are:
>
> - use RANK as you suggested.
>   Note that Hive is smart enough to optimize it correctly :
>   if you use a filter WHERE Rank() < K, it will take the K first values in
> RAM and ignore the rest, rather than ranking everything and filtering
> afterwards.
>
> - perform a GROUP BY TXN.ROW_NUM so you have only one line per
> transaction. Aggregate the rest the way you like (max, max_by or arbitrary)
>   To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
> to your where clause (and reverse the left join).
>
> - Most complicated way to implement, but simplest to use afterwards:
>   replace your CURRENCY table with a table where you have only one row per
> currency, and store the multiple start/end dates in an array of structs,
>   then a UDF that given such array and a date returns the correct
> description. It only works if you don't have too many description per
> currency though.
>   This is where Spark comes handy: it is simpler to define UDFs with it.
>
>
>
> On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
> ramasubramanian.narayanan@gmail.com> wrote:
>
>> Hi,
>>
>> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
>> It needs to be equated.
>>
>>
>> regards,
>> Rams
>>
>> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <fu...@flaminem.com>
>> wrote:
>>
>>> I would suggest to use a subquery
>>>
>>> WITH unique_currency AS (
>>>   SELECT
>>>     CCY_CD,
>>>     MAX(CNTRY_DESC) as CNTRY_DESC
>>>   FROM CURRENCY
>>>   GROUP BY CCY_CD
>>> )
>>>
>>> and then perform your left join on it.
>>>
>>> Some SQL engine (e.g. Presto) have aggregation functions like
>>> arbitrary(col) that take any value and are a little less costly than a
>>> max.
>>> Sometimes, they also have functions like max_by(x, y)
>>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>>> would allow you to get the most recent description.
>>>
>>> It is a shame that this function is not included in Hive yet, but still
>>> you can find some UDAF implementations on github
>>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>>> .
>>>
>>>
>>>
>>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>>> ramasubramanian.narayanan@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Need your suggestion on the below.
>>>>
>>>> Have two tables TXN and CURRENCY.
>>>>
>>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>>
>>>> *Two problems :*
>>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>>> handled through RANK or some other function.
>>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>>> clause' then we will loose the EUR records which should not happen.
>>>>
>>>> Please suggest a solution to over come both the problems. For
>>>> duplicated records it is fine if we select any of the CNTRY_DESC.
>>>>
>>>> *Table : CURRENCY*
>>>>
>>>> *Table : TXN*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> *EFF_ST_DT*
>>>>
>>>> *EFF_END_DT*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> INR
>>>>
>>>> Indian Rupee
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> INR
>>>>
>>>> Indian Rupee New
>>>>
>>>> 21-Feb-16
>>>>
>>>> 20-Feb-99
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> USD
>>>>
>>>> US Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-17
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar New
>>>>
>>>> 15-Feb-17
>>>>
>>>> 20-Feb-99
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar Latest
>>>>
>>>> 16-Feb-17
>>>>
>>>> 16-Feb-17
>>>>
>>>> *Expected Output*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> Indian Rupee
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> US Dollar
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> Singapore Dollar Latest (Any of three valid valid is fine)
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> <Null>
>>>>
>>>>
>>>>
>>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>>>> where
>>>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>>>
>>>>
>>>>
>>>> This query will drop the "EUR" record because of the where clause used.
>>>> It cannot be handled with case statement instead of 'where clause' as
>>>> we have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>>>>
>>>> regards,
>>>> Rams
>>>>
>>>
>>>
>>
>

Re: LEFT JOIN and WHERE CLAUSE - How to handle

Posted by Furcy Pin <fu...@flaminem.com>.

Ho, in that case...

(First I notice that you say you want all records in TXN but in the query
you give, you perform your join the other way round.)

This is a typical use case that SQL is not very good at handling...

The solutions I see are:

- use RANK as you suggested.
  Note that Hive is smart enough to optimize it correctly :
  if you use a filter WHERE Rank() < K, it will take the K first values in
RAM and ignore the rest, rather than ranking everything and filtering
afterwards.

- perform a GROUP BY TXN.ROW_NUM so you have only one line per transaction.
Aggregate the rest the way you like (max, max_by or arbitrary)
  To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
to your where clause (and reverse the left join).

- Most complicated way to implement, but simplest to use afterwards:
  replace your CURRENCY table with a table where you have only one row per
currency, and store the multiple start/end dates in an array of structs,
  then a UDF that given such array and a date returns the correct
description. It only works if you don't have too many description per
currency though.
  This is where Spark comes handy: it is simpler to define UDFs with it.



On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
ramasubramanian.narayanan@gmail.com> wrote:

> Hi,
>
> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
> It needs to be equated.
>
>
> regards,
> Rams
>
> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <fu...@flaminem.com> wrote:
>
>> I would suggest to use a subquery
>>
>> WITH unique_currency AS (
>>   SELECT
>>     CCY_CD,
>>     MAX(CNTRY_DESC) as CNTRY_DESC
>>   FROM CURRENCY
>>   GROUP BY CCY_CD
>> )
>>
>> and then perform your left join on it.
>>
>> Some SQL engine (e.g. Presto) have aggregation functions like
>> arbitrary(col) that take any value and are a little less costly than a
>> max.
>> Sometimes, they also have functions like max_by(x, y)
>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>> would allow you to get the most recent description.
>>
>> It is a shame that this function is not included in Hive yet, but still
>> you can find some UDAF implementations on github
>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>> .
>>
>>
>>
>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>> ramasubramanian.narayanan@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Need your suggestion on the below.
>>>
>>> Have two tables TXN and CURRENCY.
>>>
>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>
>>> *Two problems :*
>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>> handled through RANK or some other function.
>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>> clause' then we will loose the EUR records which should not happen.
>>>
>>> Please suggest a solution to over come both the problems. For duplicated
>>> records it is fine if we select any of the CNTRY_DESC.
>>>
>>> *Table : CURRENCY*
>>>
>>> *Table : TXN*
>>>
>>> *CCY_CD*
>>>
>>> *CNTRY_DESC*
>>>
>>> *EFF_ST_DT*
>>>
>>> *EFF_END_DT*
>>>
>>> *ROW_NUM*
>>>
>>> *CCY_CD*
>>>
>>> *TXN_DT*
>>>
>>> INR
>>>
>>> Indian Rupee
>>>
>>> 1-Jan-15
>>>
>>> 20-Feb-16
>>>
>>> 1
>>>
>>> INR
>>>
>>> 16-Feb-17
>>>
>>> INR
>>>
>>> Indian Rupee New
>>>
>>> 21-Feb-16
>>>
>>> 20-Feb-99
>>>
>>> 2
>>>
>>> USD
>>>
>>> 16-Feb-17
>>>
>>> USD
>>>
>>> US Dollar
>>>
>>> 1-Jan-15
>>>
>>> 20-Feb-16
>>>
>>> 3
>>>
>>> SGD
>>>
>>> 16-Feb-17
>>>
>>> SGD
>>>
>>> Singapore Dollar
>>>
>>> 1-Jan-15
>>>
>>> 20-Feb-17
>>>
>>> 4
>>>
>>> EUR
>>>
>>> 16-Feb-17
>>>
>>> SGD
>>>
>>> Singapore Dollar New
>>>
>>> 15-Feb-17
>>>
>>> 20-Feb-99
>>>
>>> SGD
>>>
>>> Singapore Dollar Latest
>>>
>>> 16-Feb-17
>>>
>>> 16-Feb-17
>>>
>>> *Expected Output*
>>>
>>> *ROW_NUM*
>>>
>>> *CCY_CD*
>>>
>>> *TXN_DT*
>>>
>>> *CNTRY_DESC*
>>>
>>> 1
>>>
>>> INR
>>>
>>> 16-Feb-17
>>>
>>> Indian Rupee
>>>
>>> 2
>>>
>>> USD
>>>
>>> 16-Feb-17
>>>
>>> US Dollar
>>>
>>> 3
>>>
>>> SGD
>>>
>>> 16-Feb-17
>>>
>>> Singapore Dollar Latest (Any of three valid valid is fine)
>>>
>>> 4
>>>
>>> EUR
>>>
>>> 16-Feb-17
>>>
>>> <Null>
>>>
>>>
>>>
>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>>> where
>>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>>
>>>
>>>
>>> This query will drop the "EUR" record because of the where clause used.
>>> It cannot be handled with case statement instead of 'where clause' as we
>>> have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>>>
>>> regards,
>>> Rams
>>>
>>
>>
>

Re: LEFT JOIN and WHERE CLAUSE - How to handle

Posted by Ramasubramanian Narayanan <ra...@gmail.com>.

Hi,

TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT. It
needs to be equated.


regards,
Rams

On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <fu...@flaminem.com> wrote:

> I would suggest to use a subquery
>
> WITH unique_currency AS (
>   SELECT
>     CCY_CD,
>     MAX(CNTRY_DESC) as CNTRY_DESC
>   FROM CURRENCY
>   GROUP BY CCY_CD
> )
>
> and then perform your left join on it.
>
> Some SQL engine (e.g. Presto) have aggregation functions like
> arbitrary(col) that take any value and are a little less costly than a
> max.
> Sometimes, they also have functions like max_by(x, y)
> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
> would allow you to get the most recent description.
>
> It is a shame that this function is not included in Hive yet, but still
> you can find some UDAF implementations on github
> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
> .
>
>
>
> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
> ramasubramanian.narayanan@gmail.com> wrote:
>
>> Hi,
>>
>> Need your suggestion on the below.
>>
>> Have two tables TXN and CURRENCY.
>>
>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>
>> *Two problems :*
>> 1. CURRENCY table may contain duplicate records hence it needs to be
>> handled through RANK or some other function.
>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>> clause' then we will loose the EUR records which should not happen.
>>
>> Please suggest a solution to over come both the problems. For duplicated
>> records it is fine if we select any of the CNTRY_DESC.
>>
>> *Table : CURRENCY*
>>
>> *Table : TXN*
>>
>> *CCY_CD*
>>
>> *CNTRY_DESC*
>>
>> *EFF_ST_DT*
>>
>> *EFF_END_DT*
>>
>> *ROW_NUM*
>>
>> *CCY_CD*
>>
>> *TXN_DT*
>>
>> INR
>>
>> Indian Rupee
>>
>> 1-Jan-15
>>
>> 20-Feb-16
>>
>> 1
>>
>> INR
>>
>> 16-Feb-17
>>
>> INR
>>
>> Indian Rupee New
>>
>> 21-Feb-16
>>
>> 20-Feb-99
>>
>> 2
>>
>> USD
>>
>> 16-Feb-17
>>
>> USD
>>
>> US Dollar
>>
>> 1-Jan-15
>>
>> 20-Feb-16
>>
>> 3
>>
>> SGD
>>
>> 16-Feb-17
>>
>> SGD
>>
>> Singapore Dollar
>>
>> 1-Jan-15
>>
>> 20-Feb-17
>>
>> 4
>>
>> EUR
>>
>> 16-Feb-17
>>
>> SGD
>>
>> Singapore Dollar New
>>
>> 15-Feb-17
>>
>> 20-Feb-99
>>
>> SGD
>>
>> Singapore Dollar Latest
>>
>> 16-Feb-17
>>
>> 16-Feb-17
>>
>> *Expected Output*
>>
>> *ROW_NUM*
>>
>> *CCY_CD*
>>
>> *TXN_DT*
>>
>> *CNTRY_DESC*
>>
>> 1
>>
>> INR
>>
>> 16-Feb-17
>>
>> Indian Rupee
>>
>> 2
>>
>> USD
>>
>> 16-Feb-17
>>
>> US Dollar
>>
>> 3
>>
>> SGD
>>
>> 16-Feb-17
>>
>> Singapore Dollar Latest (Any of three valid valid is fine)
>>
>> 4
>>
>> EUR
>>
>> 16-Feb-17
>>
>> <Null>
>>
>>
>>
>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>> where
>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>
>>
>>
>> This query will drop the "EUR" record because of the where clause used.
>> It cannot be handled with case statement instead of 'where clause' as we
>> have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>>
>> regards,
>> Rams
>>
>
>

Re: LEFT JOIN and WHERE CLAUSE - How to handle

Posted by Furcy Pin <fu...@flaminem.com>.

I would suggest to use a subquery

WITH unique_currency AS (
  SELECT
    CCY_CD,
    MAX(CNTRY_DESC) as CNTRY_DESC
  FROM CURRENCY
  GROUP BY CCY_CD
)

and then perform your left join on it.

Some SQL engine (e.g. Presto) have aggregation functions like arbitrary(col)
that take any value and are a little less costly than a max.
Sometimes, they also have functions like max_by(x, y)
<https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
would allow you to get the most recent description.

It is a shame that this function is not included in Hive yet, but still you
can find some UDAF implementations on github
<https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>.



On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
ramasubramanian.narayanan@gmail.com> wrote:

> Hi,
>
> Need your suggestion on the below.
>
> Have two tables TXN and CURRENCY.
>
> Need all records in TXN and hence doing Left Join with CURRENCY.
>
> *Two problems :*
> 1. CURRENCY table may contain duplicate records hence it needs to be
> handled through RANK or some other function.
> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
> clause' then we will loose the EUR records which should not happen.
>
> Please suggest a solution to over come both the problems. For duplicated
> records it is fine if we select any of the CNTRY_DESC.
>
> *Table : CURRENCY*
>
> *Table : TXN*
>
> *CCY_CD*
>
> *CNTRY_DESC*
>
> *EFF_ST_DT*
>
> *EFF_END_DT*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> INR
>
> Indian Rupee
>
> 1-Jan-15
>
> 20-Feb-16
>
> 1
>
> INR
>
> 16-Feb-17
>
> INR
>
> Indian Rupee New
>
> 21-Feb-16
>
> 20-Feb-99
>
> 2
>
> USD
>
> 16-Feb-17
>
> USD
>
> US Dollar
>
> 1-Jan-15
>
> 20-Feb-16
>
> 3
>
> SGD
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar
>
> 1-Jan-15
>
> 20-Feb-17
>
> 4
>
> EUR
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar New
>
> 15-Feb-17
>
> 20-Feb-99
>
> SGD
>
> Singapore Dollar Latest
>
> 16-Feb-17
>
> 16-Feb-17
>
> *Expected Output*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> *CNTRY_DESC*
>
> 1
>
> INR
>
> 16-Feb-17
>
> Indian Rupee
>
> 2
>
> USD
>
> 16-Feb-17
>
> US Dollar
>
> 3
>
> SGD
>
> 16-Feb-17
>
> Singapore Dollar Latest (Any of three valid valid is fine)
>
> 4
>
> EUR
>
> 16-Feb-17
>
> <Null>
>
>
>
> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
> where
> TXN_DT between EFF_ST_DT and EFF_END_DT;
>
>
>
> This query will drop the "EUR" record because of the where clause used.
> It cannot be handled with case statement instead of 'where clause' as we
> have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>
> regards,
> Rams
>