You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Himanish Kushary <hi...@gmail.com> on 2012/08/17 16:49:45 UTC

Identifying and Marking records as duplicates

Hi,

We have a huge table which may have duplicate records.A record is
considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
identify the duplicate records and possibly mark the duplicates(except the
first record based on created time for a record).

Is this something that could be done by hive or we need to write custom M/R
for this.Could a inner join or a select with group by be used to find the
duplicates ? How do I mark the duplicate records as there is no update.

Whats the best way to do this using Hive ? Looking forward to hear the
suggestions.

Thanks

Re: Identifying and Marking records as duplicates

Posted by Himanish Kushary <hi...@gmail.com>.

Thanks everyone for the different possible solutions.We resolved this by
using the rank function and following the instructions at
https://github.com/edwardcapriolo/hive-rank

This is similar to Bob's approach.

Thanks

On Mon, Aug 20, 2012 at 3:04 AM, Navis류승우 <na...@nexr.com> wrote:

> We've solved similar cases by using UDAF + UDTF, something like..
>
> select assign(unique(a, b, c, d)) as (a, b, c, d), e, f from A group by a,
> b, c, d;
>
> unique(UDAF) returns struct and assign(UDTF) reassigns names for the
> struct, where you can append a column for marking.
>
> HIVE-2608 may make this shorter.
>
>
> 2012/8/18 Himanish Kushary <hi...@gmail.com>
>
>> We want to separate them out into different tables for auditing
>> purposes, so i was thinking of marking them and then
>> separating them out based on the "duplicate" flag.
>>
>> Is there a better way of splitting them into 2 tables ? Also how would I
>> pick up the first from the group and put into a different table
>> once I have run the cross join on fld1 thru fld4 with order by timestamp?
>>
>> - Himanish
>>
>> On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> If I understand correctly :
>>> A cross join on fld1 thru fld4 with order by timestamp and take the
>>> first of each group.
>>> would allow you to have all the unique and original copies.
>>>
>>> But indeed if you don't have something like HBase, you can not mark them
>>> really but you can select them and write them elsewhere.
>>> Why do you specifically want to mark them?
>>>
>>> Bertrand
>>>
>>>
>>> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bo...@zyquest.com>wrote:
>>>
>>>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in
>>>> some of our queries. You could do that, and then do another select where
>>>> the row number/rank is 1 to get all the "unique" rows.
>>>>
>>>> There are probably a bunch of other ways to do this, but this is the
>>>> one that first came to mind for me….
>>>>
>>>> Enjoy!
>>>> Bob
>>>>
>>>> Robert Gause
>>>> Senior Systems Engineer
>>>> ZyQuest, Inc.
>>>> bob.gause@zyquest.com
>>>>
>>>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > We have a huge table which may have duplicate records.A record is
>>>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
>>>> identify the duplicate records and possibly mark the duplicates(except the
>>>> first record based on created time for a record).
>>>> >
>>>> > Is this something that could be done by hive or we need to write
>>>> custom M/R for this.Could a inner join or a select with group by be used to
>>>> find the duplicates ? How do I mark the duplicate records as there is no
>>>> update.
>>>> >
>>>> > Whats the best way to do this using Hive ? Looking forward to hear
>>>> the suggestions.
>>>> >
>>>> > Thanks
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish

Re: Identifying and Marking records as duplicates

Posted by Navis류승우 <na...@nexr.com>.

We've solved similar cases by using UDAF + UDTF, something like..

select assign(unique(a, b, c, d)) as (a, b, c, d), e, f from A group by a,
b, c, d;

unique(UDAF) returns struct and assign(UDTF) reassigns names for the
struct, where you can append a column for marking.

HIVE-2608 may make this shorter.

2012/8/18 Himanish Kushary <hi...@gmail.com>

> We want to separate them out into different tables for auditing purposes,
> so i was thinking of marking them and then
> separating them out based on the "duplicate" flag.
>
> Is there a better way of splitting them into 2 tables ? Also how would I
> pick up the first from the group and put into a different table
> once I have run the cross join on fld1 thru fld4 with order by timestamp?
>
> - Himanish
>
> On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> If I understand correctly :
>> A cross join on fld1 thru fld4 with order by timestamp and take the first
>> of each group.
>> would allow you to have all the unique and original copies.
>>
>> But indeed if you don't have something like HBase, you can not mark them
>> really but you can select them and write them elsewhere.
>> Why do you specifically want to mark them?
>>
>> Bertrand
>>
>>
>> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bo...@zyquest.com> wrote:
>>
>>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in
>>> some of our queries. You could do that, and then do another select where
>>> the row number/rank is 1 to get all the "unique" rows.
>>>
>>> There are probably a bunch of other ways to do this, but this is the one
>>> that first came to mind for me….
>>>
>>> Enjoy!
>>> Bob
>>>
>>> Robert Gause
>>> Senior Systems Engineer
>>> ZyQuest, Inc.
>>> bob.gause@zyquest.com
>>>
>>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>>>
>>> > Hi,
>>> >
>>> > We have a huge table which may have duplicate records.A record is
>>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
>>> identify the duplicate records and possibly mark the duplicates(except the
>>> first record based on created time for a record).
>>> >
>>> > Is this something that could be done by hive or we need to write
>>> custom M/R for this.Could a inner join or a select with group by be used to
>>> find the duplicates ? How do I mark the duplicate records as there is no
>>> update.
>>> >
>>> > Whats the best way to do this using Hive ? Looking forward to hear the
>>> suggestions.
>>> >
>>> > Thanks
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>
>
> --
> Thanks & Regards
> Himanish
>

Re: Identifying and Marking records as duplicates

Posted by Himanish Kushary <hi...@gmail.com>.

We want to separate them out into different tables for auditing purposes,
so i was thinking of marking them and then
separating them out based on the "duplicate" flag.

Is there a better way of splitting them into 2 tables ? Also how would I
pick up the first from the group and put into a different table
once I have run the cross join on fld1 thru fld4 with order by timestamp?

- Himanish

On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> If I understand correctly :
> A cross join on fld1 thru fld4 with order by timestamp and take the first
> of each group.
> would allow you to have all the unique and original copies.
>
> But indeed if you don't have something like HBase, you can not mark them
> really but you can select them and write them elsewhere.
> Why do you specifically want to mark them?
>
> Bertrand
>
>
> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bo...@zyquest.com> wrote:
>
>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in
>> some of our queries. You could do that, and then do another select where
>> the row number/rank is 1 to get all the "unique" rows.
>>
>> There are probably a bunch of other ways to do this, but this is the one
>> that first came to mind for me….
>>
>> Enjoy!
>> Bob
>>
>> Robert Gause
>> Senior Systems Engineer
>> ZyQuest, Inc.
>> bob.gause@zyquest.com
>>
>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>>
>> > Hi,
>> >
>> > We have a huge table which may have duplicate records.A record is
>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
>> identify the duplicate records and possibly mark the duplicates(except the
>> first record based on created time for a record).
>> >
>> > Is this something that could be done by hive or we need to write custom
>> M/R for this.Could a inner join or a select with group by be used to find
>> the duplicates ? How do I mark the duplicate records as there is no update.
>> >
>> > Whats the best way to do this using Hive ? Looking forward to hear the
>> suggestions.
>> >
>> > Thanks
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Thanks & Regards
Himanish

Re: Identifying and Marking records as duplicates

Posted by Bertrand Dechoux <de...@gmail.com>.

If I understand correctly :
A cross join on fld1 thru fld4 with order by timestamp and take the first
of each group.
would allow you to have all the unique and original copies.

But indeed if you don't have something like HBase, you can not mark them
really but you can select them and write them elsewhere.
Why do you specifically want to mark them?

Bertrand

On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bo...@zyquest.com> wrote:

> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in some
> of our queries. You could do that, and then do another select where the row
> number/rank is 1 to get all the "unique" rows.
>
> There are probably a bunch of other ways to do this, but this is the one
> that first came to mind for me….
>
> Enjoy!
> Bob
>
> Robert Gause
> Senior Systems Engineer
> ZyQuest, Inc.
> bob.gause@zyquest.com
>
> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>
> > Hi,
> >
> > We have a huge table which may have duplicate records.A record is
> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
> identify the duplicate records and possibly mark the duplicates(except the
> first record based on created time for a record).
> >
> > Is this something that could be done by hive or we need to write custom
> M/R for this.Could a inner join or a select with group by be used to find
> the duplicates ? How do I mark the duplicate records as there is no update.
> >
> > Whats the best way to do this using Hive ? Looking forward to hear the
> suggestions.
> >
> > Thanks
>
>


-- 
Bertrand Dechoux

Re: Identifying and Marking records as duplicates

Posted by Bob Gause <bo...@zyquest.com>.

We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in some of our queries. You could do that, and then do another select where the row number/rank is 1 to get all the "unique" rows.

There are probably a bunch of other ways to do this, but this is the one that first came to mind for me….

Enjoy!
Bob

Robert Gause
Senior Systems Engineer
ZyQuest, Inc.
bob.gause@zyquest.com

On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:

> Hi,
> 
> We have a huge table which may have duplicate records.A record is considered duplicate based on 4 fields ( fld1 thru fld4) . We need to identify the duplicate records and possibly mark the duplicates(except the first record based on created time for a record).
> 
> Is this something that could be done by hive or we need to write custom M/R for this.Could a inner join or a select with group by be used to find the duplicates ? How do I mark the duplicate records as there is no update.
> 
> Whats the best way to do this using Hive ? Looking forward to hear the suggestions.
> 
> Thanks