You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ramasubramanian Narayanan <ra...@gmail.com> on 2017/08/23 13:37:38 UTC
LEFT JOIN and WHERE CLAUSE - How to handle
Hi,
Need your suggestion on the below.
Have two tables TXN and CURRENCY.
Need all records in TXN and hence doing Left Join with CURRENCY.
*Two problems :*
1. CURRENCY table may contain duplicate records hence it needs to be
handled through RANK or some other function.
2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
clause' then we will loose the EUR records which should not happen.
Please suggest a solution to over come both the problems. For duplicated
records it is fine if we select any of the CNTRY_DESC.
*Table : CURRENCY*
*Table : TXN*
*CCY_CD*
*CNTRY_DESC*
*EFF_ST_DT*
*EFF_END_DT*
*ROW_NUM*
*CCY_CD*
*TXN_DT*
INR
Indian Rupee
1-Jan-15
20-Feb-16
1
INR
16-Feb-17
INR
Indian Rupee New
21-Feb-16
20-Feb-99
2
USD
16-Feb-17
USD
US Dollar
1-Jan-15
20-Feb-16
3
SGD
16-Feb-17
SGD
Singapore Dollar
1-Jan-15
20-Feb-17
4
EUR
16-Feb-17
SGD
Singapore Dollar New
15-Feb-17
20-Feb-99
SGD
Singapore Dollar Latest
16-Feb-17
16-Feb-17
*Expected Output*
*ROW_NUM*
*CCY_CD*
*TXN_DT*
*CNTRY_DESC*
1
INR
16-Feb-17
Indian Rupee
2
USD
16-Feb-17
US Dollar
3
SGD
16-Feb-17
Singapore Dollar Latest (Any of three valid valid is fine)
4
EUR
16-Feb-17
<Null>
*Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
where
TXN_DT between EFF_ST_DT and EFF_END_DT;
This query will drop the "EUR" record because of the where clause used.
It cannot be handled with case statement instead of 'where clause' as we
have more than one record for 'SGD' when TXN_DT is 16-FEB.
regards,
Rams
Re: LEFT JOIN and WHERE CLAUSE - How to handle
Posted by Ramasubramanian Narayanan <ra...@gmail.com>.
Hi,
My bad, you are right I intended to give the query as below (TXN left
joining CURRENCY not the other way around):
*Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
from TXN LEFT JOIN CURRENCY on (CURRENCY.CCY_CD = TXN.CCY_CD)
where
TXN_DT between EFF_ST_DT and EFF_END_DT;
Thanks and regards,
Rams
On Wed, Aug 23, 2017 at 9:17 PM, Furcy Pin <fu...@flaminem.com> wrote:
> Ho, in that case...
>
> (First I notice that you say you want all records in TXN but in the query
> you give, you perform your join the other way round.)
>
> This is a typical use case that SQL is not very good at handling...
>
> The solutions I see are:
>
> - use RANK as you suggested.
> Note that Hive is smart enough to optimize it correctly :
> if you use a filter WHERE Rank() < K, it will take the K first values in
> RAM and ignore the rest, rather than ranking everything and filtering
> afterwards.
>
> - perform a GROUP BY TXN.ROW_NUM so you have only one line per
> transaction. Aggregate the rest the way you like (max, max_by or arbitrary)
> To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
> to your where clause (and reverse the left join).
>
> - Most complicated way to implement, but simplest to use afterwards:
> replace your CURRENCY table with a table where you have only one row per
> currency, and store the multiple start/end dates in an array of structs,
> then a UDF that given such array and a date returns the correct
> description. It only works if you don't have too many description per
> currency though.
> This is where Spark comes handy: it is simpler to define UDFs with it.
>
>
>
> On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
> ramasubramanian.narayanan@gmail.com> wrote:
>
>> Hi,
>>
>> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
>> It needs to be equated.
>>
>>
>> regards,
>> Rams
>>
>> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <fu...@flaminem.com>
>> wrote:
>>
>>> I would suggest to use a subquery
>>>
>>> WITH unique_currency AS (
>>> SELECT
>>> CCY_CD,
>>> MAX(CNTRY_DESC) as CNTRY_DESC
>>> FROM CURRENCY
>>> GROUP BY CCY_CD
>>> )
>>>
>>> and then perform your left join on it.
>>>
>>> Some SQL engine (e.g. Presto) have aggregation functions like
>>> arbitrary(col) that take any value and are a little less costly than a
>>> max.
>>> Sometimes, they also have functions like max_by(x, y)
>>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>>> would allow you to get the most recent description.
>>>
>>> It is a shame that this function is not included in Hive yet, but still
>>> you can find some UDAF implementations on github
>>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>>> .
>>>
>>>
>>>
>>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>>> ramasubramanian.narayanan@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Need your suggestion on the below.
>>>>
>>>> Have two tables TXN and CURRENCY.
>>>>
>>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>>
>>>> *Two problems :*
>>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>>> handled through RANK or some other function.
>>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>>> clause' then we will loose the EUR records which should not happen.
>>>>
>>>> Please suggest a solution to over come both the problems. For
>>>> duplicated records it is fine if we select any of the CNTRY_DESC.
>>>>
>>>> *Table : CURRENCY*
>>>>
>>>> *Table : TXN*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> *EFF_ST_DT*
>>>>
>>>> *EFF_END_DT*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> INR
>>>>
>>>> Indian Rupee
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> INR
>>>>
>>>> Indian Rupee New
>>>>
>>>> 21-Feb-16
>>>>
>>>> 20-Feb-99
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> USD
>>>>
>>>> US Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-17
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar New
>>>>
>>>> 15-Feb-17
>>>>
>>>> 20-Feb-99
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar Latest
>>>>
>>>> 16-Feb-17
>>>>
>>>> 16-Feb-17
>>>>
>>>> *Expected Output*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> Indian Rupee
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> US Dollar
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> Singapore Dollar Latest (Any of three valid valid is fine)
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> <Null>
>>>>
>>>>
>>>>
>>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>>>> where
>>>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>>>
>>>>
>>>>
>>>> This query will drop the "EUR" record because of the where clause used.
>>>> It cannot be handled with case statement instead of 'where clause' as
>>>> we have more than one record for 'SGD' when TXN_DT is 16-FEB.
>>>>
>>>> regards,
>>>> Rams
>>>>
>>>
>>>
>>
>
Re: LEFT JOIN and WHERE CLAUSE - How to handle
Posted by peter zhang <pe...@gmail.com>.
How about splitting your txn data into two parts, one for the tx that has
currency info (just use join) and the other part for the tx that can't find
currency info then use a union all operator combines two parts tx, as below:
SELECT ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
FROM CURRENCY JOIN TXN ON (CURRENCY.CCY_CD = TXN.CCY_CD)
WHERE TXN_DT BETWEEN EFF_ST_DT AND EFF_END_DT;
UNION ALL
SELECT ROW_NUM,CCY_CD,TXN_DT, null AS CNTRY_DESC
FROM TXN
WHERE NOT EXISTS (SELECT 1 FROM CURRENCY WHERE CURRENCY.CCY_CD =
TXN.CCY_CD)
2017-08-23 23:47 GMT+08:00 Furcy Pin <fu...@flaminem.com>:
> Ho, in that case...
>
> (First I notice that you say you want all records in TXN but in the query
> you give, you perform your join the other way round.)
>
> This is a typical use case that SQL is not very good at handling...
>
> The solutions I see are:
>
> - use RANK as you suggested.
> Note that Hive is smart enough to optimize it correctly :
> if you use a filter WHERE Rank() < K, it will take the K first values in
> RAM and ignore the rest, rather than ranking everything and filtering
> afterwards.
>
> - perform a GROUP BY TXN.ROW_NUM so you have only one line per
> transaction. Aggregate the rest the way you like (max, max_by or arbitrary)
> To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
> to your where clause (and reverse the left join).
>
> - Most complicated way to implement, but simplest to use afterwards:
> replace your CURRENCY table with a table where you have only one row per
> currency, and store the multiple start/end dates in an array of structs,
> then a UDF that given such array and a date returns the correct
> description. It only works if you don't have too many description per
> currency though.
> This is where Spark comes handy: it is simpler to define UDFs with it.
>
>
>
> On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
> ramasubramanian.narayanan@gmail.com> wrote:
>
>> Hi,
>>
>> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
>> It needs to be equated.
>>
>>
>> regards,
>> Rams
>>
>> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <fu...@flaminem.com>
>> wrote:
>>
>>> I would suggest to use a subquery
>>>
>>> WITH unique_currency AS (
>>> SELECT
>>> CCY_CD,
>>> MAX(CNTRY_DESC) as CNTRY_DESC
>>> FROM CURRENCY
>>> GROUP BY CCY_CD
>>> )
>>>
>>> and then perform your left join on it.
>>>
>>> Some SQL engine (e.g. Presto) have aggregation functions like
>>> arbitrary(col) that take any value and are a little less costly than a
>>> max.
>>> Sometimes, they also have functions like max_by(x, y)
>>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>>> would allow you to get the most recent description.
>>>
>>> It is a shame that this function is not included in Hive yet, but still
>>> you can find some UDAF implementations on github
>>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>>> .
>>>
>>>
>>>
>>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>>> ramasubramanian.narayanan@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Need your suggestion on the below.
>>>>
>>>> Have two tables TXN and CURRENCY.
>>>>
>>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>>
>>>> *Two problems :*
>>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>>> handled through RANK or some other function.
>>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>>> clause' then we will loose the EUR records which should not happen.
>>>>
>>>> Please suggest a solution to over come both the problems. For
>>>> duplicated records it is fine if we select any of the CNTRY_DESC.
>>>>
>>>> *Table : CURRENCY*
>>>>
>>>> *Table : TXN*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> *EFF_ST_DT*
>>>>
>>>> *EFF_END_DT*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> INR
>>>>
>>>> Indian Rupee
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> INR
>>>>
>>>> Indian Rupee New
>>>>
>>>> 21-Feb-16
>>>>
>>>> 20-Feb-99
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> USD
>>>>
>>>> US Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-17
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar New
>>>>
>>>> 15-Feb-17
>>>>
>>>> 20-Feb-99
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar Latest
>>>>
>>>> 16-Feb-17
>>>>
>>>> 16-Feb-17
>>>>
>>>> *Expected Output*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> Indian Rupee
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> US Dollar
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> Singapore Dollar Latest (Any of three valid valid is fine)
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> <Null>
>>>>
>>>>
>>>>
>>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>>>> where
>>>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>>>
>>>>
>>>>
>>>> This query will drop the "EUR" record because of the where clause used.
>>>> It cannot be handled with case statement instead of 'where clause' as
>>>> we have more than one record for 'SGD' when TXN_DT is 16-FEB.
>>>>
>>>> regards,
>>>> Rams
>>>>
>>>
>>>
>>
>
Re: LEFT JOIN and WHERE CLAUSE - How to handle
Posted by Furcy Pin <fu...@flaminem.com>.
Ho, in that case...
(First I notice that you say you want all records in TXN but in the query
you give, you perform your join the other way round.)
This is a typical use case that SQL is not very good at handling...
The solutions I see are:
- use RANK as you suggested.
Note that Hive is smart enough to optimize it correctly :
if you use a filter WHERE Rank() < K, it will take the K first values in
RAM and ignore the rest, rather than ranking everything and filtering
afterwards.
- perform a GROUP BY TXN.ROW_NUM so you have only one line per transaction.
Aggregate the rest the way you like (max, max_by or arbitrary)
To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
to your where clause (and reverse the left join).
- Most complicated way to implement, but simplest to use afterwards:
replace your CURRENCY table with a table where you have only one row per
currency, and store the multiple start/end dates in an array of structs,
then a UDF that given such array and a date returns the correct
description. It only works if you don't have too many description per
currency though.
This is where Spark comes handy: it is simpler to define UDFs with it.
On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
ramasubramanian.narayanan@gmail.com> wrote:
> Hi,
>
> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
> It needs to be equated.
>
>
> regards,
> Rams
>
> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <fu...@flaminem.com> wrote:
>
>> I would suggest to use a subquery
>>
>> WITH unique_currency AS (
>> SELECT
>> CCY_CD,
>> MAX(CNTRY_DESC) as CNTRY_DESC
>> FROM CURRENCY
>> GROUP BY CCY_CD
>> )
>>
>> and then perform your left join on it.
>>
>> Some SQL engine (e.g. Presto) have aggregation functions like
>> arbitrary(col) that take any value and are a little less costly than a
>> max.
>> Sometimes, they also have functions like max_by(x, y)
>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>> would allow you to get the most recent description.
>>
>> It is a shame that this function is not included in Hive yet, but still
>> you can find some UDAF implementations on github
>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>> .
>>
>>
>>
>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>> ramasubramanian.narayanan@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Need your suggestion on the below.
>>>
>>> Have two tables TXN and CURRENCY.
>>>
>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>
>>> *Two problems :*
>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>> handled through RANK or some other function.
>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>> clause' then we will loose the EUR records which should not happen.
>>>
>>> Please suggest a solution to over come both the problems. For duplicated
>>> records it is fine if we select any of the CNTRY_DESC.
>>>
>>> *Table : CURRENCY*
>>>
>>> *Table : TXN*
>>>
>>> *CCY_CD*
>>>
>>> *CNTRY_DESC*
>>>
>>> *EFF_ST_DT*
>>>
>>> *EFF_END_DT*
>>>
>>> *ROW_NUM*
>>>
>>> *CCY_CD*
>>>
>>> *TXN_DT*
>>>
>>> INR
>>>
>>> Indian Rupee
>>>
>>> 1-Jan-15
>>>
>>> 20-Feb-16
>>>
>>> 1
>>>
>>> INR
>>>
>>> 16-Feb-17
>>>
>>> INR
>>>
>>> Indian Rupee New
>>>
>>> 21-Feb-16
>>>
>>> 20-Feb-99
>>>
>>> 2
>>>
>>> USD
>>>
>>> 16-Feb-17
>>>
>>> USD
>>>
>>> US Dollar
>>>
>>> 1-Jan-15
>>>
>>> 20-Feb-16
>>>
>>> 3
>>>
>>> SGD
>>>
>>> 16-Feb-17
>>>
>>> SGD
>>>
>>> Singapore Dollar
>>>
>>> 1-Jan-15
>>>
>>> 20-Feb-17
>>>
>>> 4
>>>
>>> EUR
>>>
>>> 16-Feb-17
>>>
>>> SGD
>>>
>>> Singapore Dollar New
>>>
>>> 15-Feb-17
>>>
>>> 20-Feb-99
>>>
>>> SGD
>>>
>>> Singapore Dollar Latest
>>>
>>> 16-Feb-17
>>>
>>> 16-Feb-17
>>>
>>> *Expected Output*
>>>
>>> *ROW_NUM*
>>>
>>> *CCY_CD*
>>>
>>> *TXN_DT*
>>>
>>> *CNTRY_DESC*
>>>
>>> 1
>>>
>>> INR
>>>
>>> 16-Feb-17
>>>
>>> Indian Rupee
>>>
>>> 2
>>>
>>> USD
>>>
>>> 16-Feb-17
>>>
>>> US Dollar
>>>
>>> 3
>>>
>>> SGD
>>>
>>> 16-Feb-17
>>>
>>> Singapore Dollar Latest (Any of three valid valid is fine)
>>>
>>> 4
>>>
>>> EUR
>>>
>>> 16-Feb-17
>>>
>>> <Null>
>>>
>>>
>>>
>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>>> where
>>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>>
>>>
>>>
>>> This query will drop the "EUR" record because of the where clause used.
>>> It cannot be handled with case statement instead of 'where clause' as we
>>> have more than one record for 'SGD' when TXN_DT is 16-FEB.
>>>
>>> regards,
>>> Rams
>>>
>>
>>
>
Re: LEFT JOIN and WHERE CLAUSE - How to handle
Posted by Ramasubramanian Narayanan <ra...@gmail.com>.
Hi,
TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT. It
needs to be equated.
regards,
Rams
On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <fu...@flaminem.com> wrote:
> I would suggest to use a subquery
>
> WITH unique_currency AS (
> SELECT
> CCY_CD,
> MAX(CNTRY_DESC) as CNTRY_DESC
> FROM CURRENCY
> GROUP BY CCY_CD
> )
>
> and then perform your left join on it.
>
> Some SQL engine (e.g. Presto) have aggregation functions like
> arbitrary(col) that take any value and are a little less costly than a
> max.
> Sometimes, they also have functions like max_by(x, y)
> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
> would allow you to get the most recent description.
>
> It is a shame that this function is not included in Hive yet, but still
> you can find some UDAF implementations on github
> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
> .
>
>
>
> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
> ramasubramanian.narayanan@gmail.com> wrote:
>
>> Hi,
>>
>> Need your suggestion on the below.
>>
>> Have two tables TXN and CURRENCY.
>>
>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>
>> *Two problems :*
>> 1. CURRENCY table may contain duplicate records hence it needs to be
>> handled through RANK or some other function.
>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>> clause' then we will loose the EUR records which should not happen.
>>
>> Please suggest a solution to over come both the problems. For duplicated
>> records it is fine if we select any of the CNTRY_DESC.
>>
>> *Table : CURRENCY*
>>
>> *Table : TXN*
>>
>> *CCY_CD*
>>
>> *CNTRY_DESC*
>>
>> *EFF_ST_DT*
>>
>> *EFF_END_DT*
>>
>> *ROW_NUM*
>>
>> *CCY_CD*
>>
>> *TXN_DT*
>>
>> INR
>>
>> Indian Rupee
>>
>> 1-Jan-15
>>
>> 20-Feb-16
>>
>> 1
>>
>> INR
>>
>> 16-Feb-17
>>
>> INR
>>
>> Indian Rupee New
>>
>> 21-Feb-16
>>
>> 20-Feb-99
>>
>> 2
>>
>> USD
>>
>> 16-Feb-17
>>
>> USD
>>
>> US Dollar
>>
>> 1-Jan-15
>>
>> 20-Feb-16
>>
>> 3
>>
>> SGD
>>
>> 16-Feb-17
>>
>> SGD
>>
>> Singapore Dollar
>>
>> 1-Jan-15
>>
>> 20-Feb-17
>>
>> 4
>>
>> EUR
>>
>> 16-Feb-17
>>
>> SGD
>>
>> Singapore Dollar New
>>
>> 15-Feb-17
>>
>> 20-Feb-99
>>
>> SGD
>>
>> Singapore Dollar Latest
>>
>> 16-Feb-17
>>
>> 16-Feb-17
>>
>> *Expected Output*
>>
>> *ROW_NUM*
>>
>> *CCY_CD*
>>
>> *TXN_DT*
>>
>> *CNTRY_DESC*
>>
>> 1
>>
>> INR
>>
>> 16-Feb-17
>>
>> Indian Rupee
>>
>> 2
>>
>> USD
>>
>> 16-Feb-17
>>
>> US Dollar
>>
>> 3
>>
>> SGD
>>
>> 16-Feb-17
>>
>> Singapore Dollar Latest (Any of three valid valid is fine)
>>
>> 4
>>
>> EUR
>>
>> 16-Feb-17
>>
>> <Null>
>>
>>
>>
>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>> where
>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>
>>
>>
>> This query will drop the "EUR" record because of the where clause used.
>> It cannot be handled with case statement instead of 'where clause' as we
>> have more than one record for 'SGD' when TXN_DT is 16-FEB.
>>
>> regards,
>> Rams
>>
>
>
Re: LEFT JOIN and WHERE CLAUSE - How to handle
Posted by Furcy Pin <fu...@flaminem.com>.
I would suggest to use a subquery
WITH unique_currency AS (
SELECT
CCY_CD,
MAX(CNTRY_DESC) as CNTRY_DESC
FROM CURRENCY
GROUP BY CCY_CD
)
and then perform your left join on it.
Some SQL engine (e.g. Presto) have aggregation functions like arbitrary(col)
that take any value and are a little less costly than a max.
Sometimes, they also have functions like max_by(x, y)
<https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
would allow you to get the most recent description.
It is a shame that this function is not included in Hive yet, but still you
can find some UDAF implementations on github
<https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>.
On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
ramasubramanian.narayanan@gmail.com> wrote:
> Hi,
>
> Need your suggestion on the below.
>
> Have two tables TXN and CURRENCY.
>
> Need all records in TXN and hence doing Left Join with CURRENCY.
>
> *Two problems :*
> 1. CURRENCY table may contain duplicate records hence it needs to be
> handled through RANK or some other function.
> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
> clause' then we will loose the EUR records which should not happen.
>
> Please suggest a solution to over come both the problems. For duplicated
> records it is fine if we select any of the CNTRY_DESC.
>
> *Table : CURRENCY*
>
> *Table : TXN*
>
> *CCY_CD*
>
> *CNTRY_DESC*
>
> *EFF_ST_DT*
>
> *EFF_END_DT*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> INR
>
> Indian Rupee
>
> 1-Jan-15
>
> 20-Feb-16
>
> 1
>
> INR
>
> 16-Feb-17
>
> INR
>
> Indian Rupee New
>
> 21-Feb-16
>
> 20-Feb-99
>
> 2
>
> USD
>
> 16-Feb-17
>
> USD
>
> US Dollar
>
> 1-Jan-15
>
> 20-Feb-16
>
> 3
>
> SGD
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar
>
> 1-Jan-15
>
> 20-Feb-17
>
> 4
>
> EUR
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar New
>
> 15-Feb-17
>
> 20-Feb-99
>
> SGD
>
> Singapore Dollar Latest
>
> 16-Feb-17
>
> 16-Feb-17
>
> *Expected Output*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> *CNTRY_DESC*
>
> 1
>
> INR
>
> 16-Feb-17
>
> Indian Rupee
>
> 2
>
> USD
>
> 16-Feb-17
>
> US Dollar
>
> 3
>
> SGD
>
> 16-Feb-17
>
> Singapore Dollar Latest (Any of three valid valid is fine)
>
> 4
>
> EUR
>
> 16-Feb-17
>
> <Null>
>
>
>
> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
> where
> TXN_DT between EFF_ST_DT and EFF_END_DT;
>
>
>
> This query will drop the "EUR" record because of the where clause used.
> It cannot be handled with case statement instead of 'where clause' as we
> have more than one record for 'SGD' when TXN_DT is 16-FEB.
>
> regards,
> Rams
>