You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Ryan C. Kleck" <rk...@godaddy.com> on 2018/02/02 21:59:51 UTC

Joining Integer To String Can Produce Duplicates

Hey all,  I’m on Hive 1.2.2 at work and I found some unfavorable behavior on one of my joins and I wanted to see what you all think.  Below is an example:
https://github.secureserver.net/gist/rkleck/258a9a7b3dd3c915f94a53234e422a1a

WITH string_key_setup AS (
                SELECT                   CAST('1234  ' AS STRING)                   AS my_key
                UNION ALL
                SELECT                   CAST('1234' AS STRING)                     AS my_key
)
, group_setup AS (
                SELECT                   my_key                                  AS my_key
                FROM                     string_key_setup
                GROUP BY             my_key
)
, string_key AS (
                SELECT                   CAST('1234' AS STRING)     AS my_key
)
, integer_key AS (
                SELECT                   CAST('1234' AS INT)                            AS my_key
)
SELECT                                   'String To String Join'                          AS join_type
                                                                , COUNT(1)                            AS num_rows
FROM                                     string_key t1
JOIN                                        group_setup t2
ON                                                          t1.my_key = t2.my_key
UNION ALL
SELECT                                   'Integer To String Join'                        AS join_type
                                                                , COUNT(1)                            AS num_rows
FROM                                     integer_key t1
JOIN                                        group_setup t2
ON                                                          t1.my_key = t2.my_key
;


This query returns the following:
join_type

num_rows

Integer To String Join

2

String To String Join

1



I feel it’s unfavorable because the GROUP BY is not TRIMming the extra space around the string, but when we do a join against an integer it does trim the space.  This query should not produce multiple rows.

In my opinion, there should be another check when comparing string to int to make sure the size of the string and integer are the same (so in this example the row with key ‘1234  ‘ will be filtered out).  Furthermore, the original query that produces these dupes has MANY more joins.  I could fix by CASTing/TRIMing, but it would require me to know all the data types for the columns in the tables involved in the join (and maybe casting a string to int will lose some rows and you can’t TRIM an INT).

Thoughts?



Ryan Kleck
Data Engineer IV
Advanced Analytics
480-505-8800 xt. 4024

This email message and any attachments hereto are intended for use only by its intended recipient(s) and may contain confidential information. If you have received this email in error, please immediately notify the sender and permanently delete the original and any copy of this message and its attachments.