You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Ryan C. Kleck" <rk...@godaddy.com> on 2018/02/02 21:59:51 UTC
Joining Integer To String Can Produce Duplicates
Hey all, I’m on Hive 1.2.2 at work and I found some unfavorable behavior on one of my joins and I wanted to see what you all think. Below is an example:
https://github.secureserver.net/gist/rkleck/258a9a7b3dd3c915f94a53234e422a1a
WITH string_key_setup AS (
SELECT CAST('1234 ' AS STRING) AS my_key
UNION ALL
SELECT CAST('1234' AS STRING) AS my_key
)
, group_setup AS (
SELECT my_key AS my_key
FROM string_key_setup
GROUP BY my_key
)
, string_key AS (
SELECT CAST('1234' AS STRING) AS my_key
)
, integer_key AS (
SELECT CAST('1234' AS INT) AS my_key
)
SELECT 'String To String Join' AS join_type
, COUNT(1) AS num_rows
FROM string_key t1
JOIN group_setup t2
ON t1.my_key = t2.my_key
UNION ALL
SELECT 'Integer To String Join' AS join_type
, COUNT(1) AS num_rows
FROM integer_key t1
JOIN group_setup t2
ON t1.my_key = t2.my_key
;
This query returns the following:
join_type
num_rows
Integer To String Join
2
String To String Join
1
I feel it’s unfavorable because the GROUP BY is not TRIMming the extra space around the string, but when we do a join against an integer it does trim the space. This query should not produce multiple rows.
In my opinion, there should be another check when comparing string to int to make sure the size of the string and integer are the same (so in this example the row with key ‘1234 ‘ will be filtered out). Furthermore, the original query that produces these dupes has MANY more joins. I could fix by CASTing/TRIMing, but it would require me to know all the data types for the columns in the tables involved in the join (and maybe casting a string to int will lose some rows and you can’t TRIM an INT).
Thoughts?
Ryan Kleck
Data Engineer IV
Advanced Analytics
480-505-8800 xt. 4024
This email message and any attachments hereto are intended for use only by its intended recipient(s) and may contain confidential information. If you have received this email in error, please immediately notify the sender and permanently delete the original and any copy of this message and its attachments.