You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Philo Wang <ph...@gmail.com> on 2013/10/02 02:37:59 UTC

Dealing with duplicate rows in Hive

Hi,

I am using Hive 8.1.8 in EMR.

We have an extremely large table (~50 columns) where the uniqueness key is
a combination of 9 different columns. I want to filter out any duplicate
rows based on these 9 columns while retaining the ability to select other
columns on an ad hoc basis. I don’t expect rows with the same uniqueness
key to have different data, so I guess this can be generalized to just
filtering out duplicate rows.

My initial instinct was to do a “select distinct *” on the table and save
the results into another table, but it appears that Hive does not support
“distinct *”. Furthermore, Hive will apply distinct to every column in the
select statement, so something like “select distinct(a), b” does not work
either.

The only option I could think of from here was to explicitly state all
columns of the table inside the distinct statement, but this seems
unnecessarily messy (again, the table contains more than 50 columns).

Has anyone ran into a similar issue? Any insight would be appreciated.

Thanks,
Philo

Re: Dealing with duplicate rows in Hive

Posted by Nitin Pawar <ni...@gmail.com>.

Yes doing group by or distinct on 50 columns is ugly.


One option (ugly as well) is first select only these 9 columns and then do
a select * with join of the first .

something like
(select distinct cols from table) a join (select * from table b) on (a.col
= b.col)
I am really not sure this works but looks like a hack to make it work.

I will try to this in sometime and see if it works.


On Wed, Oct 2, 2013 at 1:28 PM, Philo Wang <ph...@gmail.com> wrote:

> Thanks for the suggestion! Unfortunately, if you use group by in a query
> all columns in the select statement must also appear in the group by. I can
> always select distinct on all 50 columns (or group by all 50 columns), but
> that sounds very extreme and I feel that there has to be a better solution
> out there. Something like a "select distinct *" seems like it would work
> here but according to https://issues.apache.org/jira/browse/HIVE-3199 that
> is not yet supported in Hive.
>
>
> On Wed, Oct 2, 2013 at 12:13 AM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> may be you want to try group by
>>
>> in hive select distinct col1, col2, col3 works but if you want to select
>> all 50 columns its tricky.
>>
>> Other option would be group by all those 9 keys and it should take care
>> that you have combination of those 9 columns is unique.
>>
>>
>> On Wed, Oct 2, 2013 at 12:34 PM, Philo Wang <ph...@gmail.com> wrote:
>>
>>> Yes, that is correct.
>>>
>>>
>>> On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> So you have 50 columns and out of them you want to use 9 columns for
>>>> finding unique rows?
>>>>
>>>> am i correct in assuming that you want to make a key of combination of
>>>> these 9 columns so that you have just one row for a single combination of
>>>> these 9 columns ?
>>>>
>>>>
>>>> On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <ph...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using Hive 8.1.8 in EMR.
>>>>>
>>>>> We have an extremely large table (~50 columns) where the uniqueness
>>>>> key is a combination of 9 different columns. I want to filter out any
>>>>> duplicate rows based on these 9 columns while retaining the ability to
>>>>> select other columns on an ad hoc basis. I don’t expect rows with the same
>>>>> uniqueness key to have different data, so I guess this can be generalized
>>>>> to just filtering out duplicate rows.
>>>>>
>>>>> My initial instinct was to do a “select distinct *” on the table and
>>>>> save the results into another table, but it appears that Hive does not
>>>>> support “distinct *”. Furthermore, Hive will apply distinct to every column
>>>>> in the select statement, so something like “select distinct(a), b” does not
>>>>> work either.
>>>>>
>>>>> The only option I could think of from here was to explicitly state all
>>>>> columns of the table inside the distinct statement, but this seems
>>>>> unnecessarily messy (again, the table contains more than 50 columns).
>>>>>
>>>>> Has anyone ran into a similar issue? Any insight would be appreciated.
>>>>>
>>>>> Thanks,
>>>>> Philo
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Dealing with duplicate rows in Hive

Posted by Philo Wang <ph...@gmail.com>.

Thanks for the suggestion! Unfortunately, if you use group by in a query
all columns in the select statement must also appear in the group by. I can
always select distinct on all 50 columns (or group by all 50 columns), but
that sounds very extreme and I feel that there has to be a better solution
out there. Something like a "select distinct *" seems like it would work
here but according to https://issues.apache.org/jira/browse/HIVE-3199 that
is not yet supported in Hive.


On Wed, Oct 2, 2013 at 12:13 AM, Nitin Pawar <ni...@gmail.com>wrote:

> may be you want to try group by
>
> in hive select distinct col1, col2, col3 works but if you want to select
> all 50 columns its tricky.
>
> Other option would be group by all those 9 keys and it should take care
> that you have combination of those 9 columns is unique.
>
>
> On Wed, Oct 2, 2013 at 12:34 PM, Philo Wang <ph...@gmail.com> wrote:
>
>> Yes, that is correct.
>>
>>
>> On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> So you have 50 columns and out of them you want to use 9 columns for
>>> finding unique rows?
>>>
>>> am i correct in assuming that you want to make a key of combination of
>>> these 9 columns so that you have just one row for a single combination of
>>> these 9 columns ?
>>>
>>>
>>> On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <ph...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using Hive 8.1.8 in EMR.
>>>>
>>>> We have an extremely large table (~50 columns) where the uniqueness key
>>>> is a combination of 9 different columns. I want to filter out any duplicate
>>>> rows based on these 9 columns while retaining the ability to select other
>>>> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
>>>> key to have different data, so I guess this can be generalized to just
>>>> filtering out duplicate rows.
>>>>
>>>> My initial instinct was to do a “select distinct *” on the table and
>>>> save the results into another table, but it appears that Hive does not
>>>> support “distinct *”. Furthermore, Hive will apply distinct to every column
>>>> in the select statement, so something like “select distinct(a), b” does not
>>>> work either.
>>>>
>>>> The only option I could think of from here was to explicitly state all
>>>> columns of the table inside the distinct statement, but this seems
>>>> unnecessarily messy (again, the table contains more than 50 columns).
>>>>
>>>> Has anyone ran into a similar issue? Any insight would be appreciated.
>>>>
>>>> Thanks,
>>>> Philo
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Dealing with duplicate rows in Hive

Posted by Nitin Pawar <ni...@gmail.com>.

may be you want to try group by

in hive select distinct col1, col2, col3 works but if you want to select
all 50 columns its tricky.

Other option would be group by all those 9 keys and it should take care
that you have combination of those 9 columns is unique.


On Wed, Oct 2, 2013 at 12:34 PM, Philo Wang <ph...@gmail.com> wrote:

> Yes, that is correct.
>
>
> On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> So you have 50 columns and out of them you want to use 9 columns for
>> finding unique rows?
>>
>> am i correct in assuming that you want to make a key of combination of
>> these 9 columns so that you have just one row for a single combination of
>> these 9 columns ?
>>
>>
>> On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <ph...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am using Hive 8.1.8 in EMR.
>>>
>>> We have an extremely large table (~50 columns) where the uniqueness key
>>> is a combination of 9 different columns. I want to filter out any duplicate
>>> rows based on these 9 columns while retaining the ability to select other
>>> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
>>> key to have different data, so I guess this can be generalized to just
>>> filtering out duplicate rows.
>>>
>>> My initial instinct was to do a “select distinct *” on the table and
>>> save the results into another table, but it appears that Hive does not
>>> support “distinct *”. Furthermore, Hive will apply distinct to every column
>>> in the select statement, so something like “select distinct(a), b” does not
>>> work either.
>>>
>>> The only option I could think of from here was to explicitly state all
>>> columns of the table inside the distinct statement, but this seems
>>> unnecessarily messy (again, the table contains more than 50 columns).
>>>
>>> Has anyone ran into a similar issue? Any insight would be appreciated.
>>>
>>> Thanks,
>>> Philo
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Dealing with duplicate rows in Hive

Posted by Philo Wang <ph...@gmail.com>.

Yes, that is correct.


On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar <ni...@gmail.com>wrote:

> So you have 50 columns and out of them you want to use 9 columns for
> finding unique rows?
>
> am i correct in assuming that you want to make a key of combination of
> these 9 columns so that you have just one row for a single combination of
> these 9 columns ?
>
>
> On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <ph...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using Hive 8.1.8 in EMR.
>>
>> We have an extremely large table (~50 columns) where the uniqueness key
>> is a combination of 9 different columns. I want to filter out any duplicate
>> rows based on these 9 columns while retaining the ability to select other
>> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
>> key to have different data, so I guess this can be generalized to just
>> filtering out duplicate rows.
>>
>> My initial instinct was to do a “select distinct *” on the table and save
>> the results into another table, but it appears that Hive does not support
>> “distinct *”. Furthermore, Hive will apply distinct to every column in the
>> select statement, so something like “select distinct(a), b” does not work
>> either.
>>
>> The only option I could think of from here was to explicitly state all
>> columns of the table inside the distinct statement, but this seems
>> unnecessarily messy (again, the table contains more than 50 columns).
>>
>> Has anyone ran into a similar issue? Any insight would be appreciated.
>>
>> Thanks,
>> Philo
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Dealing with duplicate rows in Hive

Posted by Nitin Pawar <ni...@gmail.com>.

So you have 50 columns and out of them you want to use 9 columns for
finding unique rows?

am i correct in assuming that you want to make a key of combination of
these 9 columns so that you have just one row for a single combination of
these 9 columns ?


On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <ph...@gmail.com> wrote:

> Hi,
>
> I am using Hive 8.1.8 in EMR.
>
> We have an extremely large table (~50 columns) where the uniqueness key is
> a combination of 9 different columns. I want to filter out any duplicate
> rows based on these 9 columns while retaining the ability to select other
> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
> key to have different data, so I guess this can be generalized to just
> filtering out duplicate rows.
>
> My initial instinct was to do a “select distinct *” on the table and save
> the results into another table, but it appears that Hive does not support
> “distinct *”. Furthermore, Hive will apply distinct to every column in the
> select statement, so something like “select distinct(a), b” does not work
> either.
>
> The only option I could think of from here was to explicitly state all
> columns of the table inside the distinct statement, but this seems
> unnecessarily messy (again, the table contains more than 50 columns).
>
> Has anyone ran into a similar issue? Any insight would be appreciated.
>
> Thanks,
> Philo
>
>


-- 
Nitin Pawar