You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sunita Arvind <su...@gmail.com> on 2013/07/30 16:59:52 UTC

Hive Join with distinct rows

Hi Praveen / All,

I also have a requirement similar to the one explained (by Praveen) below:
distinct rows on a single column with corresponding data from other columns.

http://mail-archives.apache.org/mod_mbox/hive-user/201211.mbox/%3CCAHmB8ta+r0H5A+aRMutOoKHKP8FXCToWN68QoZ6LkfMwBrKGTA@mail.gmail.com%3E

This email thread dates back to Nov 2012 and is a very common use case.I
just wanted to check if there is a solution already or we still need to
write a UDF.

regards
Sunita

Re: Hive Join with distinct rows

Posted by Sunita Arvind <su...@gmail.com>.
Thanks for sharing your experience Marcin

Sunita


On Tue, Jul 30, 2013 at 11:54 AM, Marcin Mejran <marcin.mejran@hooklogic.com
> wrote:

>  I’ve used a rank udf for this previously, distribute and sort by the
> column then select all rows where rank=1. That should work with a join but
> I never tried it. It’d be an issue if the join outputs a lot of records
> that then are all dropped since that’d slow down the query.****
>
> ** **
>
> I’ve actually forked Hive internally and added a distinct join based on
> the, now deprecated I guess, unique join code. It’s ugly in terms of syntax
> and I haven’t had a chance to open source it but it allows a good amount of
> control over what is joined to what (ie: select the row in table A whose
> column x is closets to column y in table B, for example request time). I
> really wish Hive had better support for such “non-SQL” types of queries
> which are common in a world of unstructured and un-clean data.****
>
> ** **
>
> -Marcin****
>
> ** **
>
> *From:* Sunita Arvind [mailto:sunitarvind@gmail.com]
> *Sent:* Tuesday, July 30, 2013 11:00 AM
> *To:* user@hive.apache.org
> *Subject:* Hive Join with distinct rows****
>
> ** **
>
> Hi Praveen / All,
>
> I also have a requirement similar to the one explained (by Praveen) below:
> distinct rows on a single column with corresponding data from other
> columns.
>
>
> http://mail-archives.apache.org/mod_mbox/hive-user/201211.mbox/%3CCAHmB8ta+r0H5A+aRMutOoKHKP8FXCToWN68QoZ6LkfMwBrKGTA@mail.gmail.com%3E
> ****
>
> This email thread dates back to Nov 2012 and is a very common use case.I
> just wanted to check if there is a solution already or we still need to
> write a UDF.****
>
> regards
> Sunita****
>

RE: Hive Join with distinct rows

Posted by Marcin Mejran <ma...@hooklogic.com>.
I've used a rank udf for this previously, distribute and sort by the column then select all rows where rank=1. That should work with a join but I never tried it. It'd be an issue if the join outputs a lot of records that then are all dropped since that'd slow down the query.

I've actually forked Hive internally and added a distinct join based on the, now deprecated I guess, unique join code. It's ugly in terms of syntax and I haven't had a chance to open source it but it allows a good amount of control over what is joined to what (ie: select the row in table A whose column x is closets to column y in table B, for example request time). I really wish Hive had better support for such "non-SQL" types of queries which are common in a world of unstructured and un-clean data.

-Marcin

From: Sunita Arvind [mailto:sunitarvind@gmail.com]
Sent: Tuesday, July 30, 2013 11:00 AM
To: user@hive.apache.org
Subject: Hive Join with distinct rows

Hi Praveen / All,

I also have a requirement similar to the one explained (by Praveen) below:
distinct rows on a single column with corresponding data from other columns.

http://mail-archives.apache.org/mod_mbox/hive-user/201211.mbox/%3CCAHmB8ta+r0H5A+aRMutOoKHKP8FXCToWN68QoZ6LkfMwBrKGTA@mail.gmail.com%3E
This email thread dates back to Nov 2012 and is a very common use case.I just wanted to check if there is a solution already or we still need to write a UDF.
regards
Sunita