You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Raj Hadoop <ha...@yahoo.com> on 2014/09/10 19:04:57 UTC

Remove duplicate records in Hive

Hi,

I have a requirement in Hive to remove duplicate records ( they differ only by one column i.e a date column) and keep the latest date record.

Sample :
Hive Table :
 d2 is a higher 
cno,sqno,date

100 1 1-oct-2013
101 2 1-oct-2013
100 1 2-oct-2013
102 2 2-oct-2013


Output needed:

100 1 2-oct-2013
101 2 1-oct-2013
102 2 2-oct-2013

I am using Hive 0.11

Any suggestions please ?

Regards,
Raj

Re: Remove duplicate records in Hive

Posted by Kevin Weiler <Ke...@imc-chicago.com>.

Whoops, thought this was someone in my office, so obviously you can’t come see me :)

On Sep 10, 2014, at 12:04 PM, Raj Hadoop <ha...@yahoo.com>> wrote:

Hi,

I have a requirement in Hive to remove duplicate records ( they differ only by one column i.e a date column) and keep the latest date record.

Sample :
Hive Table :
d2 is a higher
cno,sqno,date

100 1 1-oct-2013
101 2 1-oct-2013
100 1 2-oct-2013
102 2 2-oct-2013

Output needed:

100 1 2-oct-2013
101 2 1-oct-2013
102 2 2-oct-2013

I am using Hive 0.11

Any suggestions please ?

Regards,
Raj

________________________________

The information in this e-mail is intended only for the person or entity to which it is addressed.

It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by "reply" and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets & Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e-mail to the sender and permanently delete this message and any attachments.

Messages and attachments are scanned for all known viruses. Always scan attachments before opening them.

RES: Remove duplicate records in Hive

Posted by Felipe Naberezny Lopes <fe...@b2wdigital.com>.

¬¬'

-----Mensagem original-----
De: Raj Hadoop [mailto:hadoopraj@yahoo.com]
Enviada em: quarta-feira, 10 de setembro de 2014 15:42
Para: user@hive.apache.org
Assunto: Re: Remove duplicate records in Hive

Thanks. I will try it.
--------------------------------------------
On Wed, 9/10/14, Nishant Kelkar <ni...@gmail.com> wrote:

Subject: Re: Remove duplicate records in Hive
To: user@hive.apache.org, hadoopraj@yahoo.com
Date: Wednesday, September 10, 2014, 1:59 PM

Hi
Raj,
You can do something
along these lines:

SELECT
cno, sqno, SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date FROM table GROUP BY cno, sqno; However, you have to make sure your date format is such that sorting it gives you the most recent date. The best way to do that is to have it in
format: YYYY-MM-DD.
Hope this helps.
Best Regards,Nishant
Kelkar
On Wed, Sep 10, 2014 at
10:04 AM, Raj Hadoop <ha...@yahoo.com>
wrote:

Hi,

I have a requirement in Hive to remove duplicate records ( they differ only by one column i.e a date column) and keep the latest date record.

Sample :

Hive Table :

d2 is a higher

cno,sqno,date

100 1 1-oct-2013

101 2 1-oct-2013

100 1 2-oct-2013

102 2 2-oct-2013

Output needed:

100 1 2-oct-2013

101 2 1-oct-2013

102 2 2-oct-2013

I am using Hive 0.11

Any suggestions please ?

Regards,

Raj

________________________________
Esta mensagem pode conter informações confidenciais e somente o indivíduo ou entidade a quem foi destinada pode utilizá-la. A transmissão incorreta da mensagem não acarreta a perda de sua confidencialidade. Caso esta mensagem tenha sido recebida por engano, solicitamos que o fato seja comunicado ao remetente e que a mensagem seja eliminada de seu sistema imediatamente. É vedado a qualquer pessoa que não seja o destinatário usar, revelar, distribuir ou copiar qualquer parte desta mensagem. Ambiente de comunicação sujeito a monitoramento.

This message may include confidential information and only the intended addresses have the right to use it as is, or any part of it. A wrong transmission does not break its confidentiality. If you've received it because of a mistake or erroneous transmission, please notify the sender and delete it from your system immediately. This communication environment is controlled and monitored.

B2W Digital

Re: Remove duplicate records in Hive

Posted by Raj Hadoop <ha...@yahoo.com>.

Thank you all for your suggestions. This group is the best.

I am working with the different options you guys suggested.

One big question I have is -

I am good at writing Oracle SQL queries. But the syntax with Hive is different. Especially - wiritng multiple SELECT statements in a single Hive Query has become a challenge. Can the group suggest any good tutorial that explains the basics of "Syntax to develop complex queries in Hive".

Regards,
Rajendra





On Thursday, September 11, 2014 2:48 AM, vivek thakre <vi...@gmail.com> wrote:
 


Considering that the records only differ by one column i.e if the first two columns are are unique (distinct), then you simply use group by with max as aggregation function to eliminate duplicates i,e 

select cno, sqno, max (date) 
from table 
group by cno, sqno

If the above assumption is not true i.e if cno and sqno are not unique and for a particular cno, you want to get sqno with latest date, then you can do inner join with max select query something like

select a.cno, a.sqno, a.date
from table a 
join (select cno, max(date)  as max_date from table group by cno) b
on a.cno=b.cno
and a.date = b.max_date



On Wed, Sep 10, 2014 at 3:39 PM, Nishant Kelkar <ni...@gmail.com> wrote:

Try something like this then:
>
>
>SELECT A.cno, A.sqno, A.sorted_dates[A.size-1] AS latest_date
>FROM 
>(
>SELECT cno, sqno,
>SORT_ARRAY(COLLECT_SET(date)) AS sorted_dates, SIZE(COLLECT_SET(date)) AS size 
>FROM table GROUP BY cno, sqno
>) A;
>
>
>
>There are better ways of doing this, but this one's quick and dirty :)
>
>
>Best Regards,
>Nishant Kelkar
>
>
>On Wed, Sep 10, 2014 at 12:48 PM, Raj Hadoop <ha...@yahoo.com> wrote:
>
>sort_array returns in ascending order. so the first element cannot be the largest date. the last element is the largest date.
>>
>>
>>
>>
>>On Wednesday, September 10, 2014 3:38 PM, Nishant Kelkar <ni...@gmail.com> wrote:
>> 
>>
>>
>>Hi Raj,
>>
>>
>>You'll have to change the format of your date to something like YYYY-MM-DD. For example, for "2-oct-2013" it will be 2013-10-02.
>>
>>
>>Best Regards,
>>Nishant Kelkar
>>
>>
>>
>>
>>
>>On Wed, Sep 10, 2014 at 11:48 AM, Raj Hadoop <ha...@yahoo.com> wrote:
>>
>>The
>>>
>>>SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>>>
>>>is returning the lowest date. I need the largest date.
>>>
>>>
>>>
>>>--------------------------------------------
>>>On Wed, 9/10/14, Raj Hadoop <ha...@yahoo.com> wrote:
>>>
>>> Subject: Re: Remove duplicate records in Hive
>>> To: user@hive.apache.org
>>> Date: Wednesday, September 10, 2014, 2:41 PM
>>>
>>>
>>> Thanks. I will try it.
>>> --------------------------------------------
>>> On Wed, 9/10/14, Nishant Kelkar <ni...@gmail.com>
>>> wrote:
>>>
>>>  Subject: Re: Remove
>>> duplicate records in Hive
>>>  To: user@hive.apache.org,
>>> hadoopraj@yahoo.com
>>>  Date: Wednesday, September 10, 2014, 1:59
>>> PM
>>>
>>>  Hi
>>>
>>> Raj, 
>>>  You can do something
>>>  along these lines: 
>>>
>>>  SELECT
>>>  cno, sqno,
>>> SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>>>  FROM table GROUP BY cno, sqno;
>>>  However, you have to make sure your
>>>  date format is such that sorting it gives you
>>> the most
>>>  recent date. The best way to do
>>> that is to have it in
>>>  format:
>>> YYYY-MM-DD.
>>>  Hope this helps.
>>>  Best Regards,Nishant
>>>
>>> Kelkar
>>>  On Wed, Sep 10, 2014 at
>>>  10:04 AM, Raj Hadoop <ha...@yahoo.com>
>>>  wrote:
>>>
>>>
>>>  Hi,
>>>
>>>
>>>
>>>  I have a requirement in Hive
>>> to remove duplicate records (
>>>  they differ
>>> only by one column i.e a date column) and keep
>>>  the latest date record.
>>>
>>>
>>>
>>>  Sample
>>> :
>>>
>>>  Hive Table :
>>>
>>>   d2 is a higher
>>>
>>>  cno,sqno,date
>>>
>>>
>>>
>>>  100 1 1-oct-2013
>>>
>>>  101 2 1-oct-2013
>>>
>>>  100 1 2-oct-2013
>>>
>>>  102 2 2-oct-2013
>>>
>>>
>>>
>>>
>>>
>>>  Output needed:
>>>
>>>
>>>
>>>  100 1 2-oct-2013
>>>
>>>  101 2 1-oct-2013
>>>
>>>  102 2 2-oct-2013
>>>
>>>
>>>
>>>  I am using
>>> Hive 0.11
>>>
>>>
>>>
>>>  Any suggestions please ?
>>>
>>>
>>>
>>>  Regards,
>>>
>>>
>>> Raj
>>>
>>>
>>>
>>>
>>
>>
>>
>

Re: Remove duplicate records in Hive

Posted by vivek thakre <vi...@gmail.com>.

Considering that the records only differ by one column i.e if the first two
columns are are unique (distinct), then you simply use group by with max as
aggregation function to eliminate duplicates i,e

select cno, sqno, max (date)
from table
group by cno, sqno

If the above assumption is not true i.e if cno and sqno are not unique and
for a particular cno, you want to get sqno with latest date, then you can
do inner join with max select query something like

select a.cno, a.sqno, a.date
from table a
join (select cno, max(date)  as max_date from table group by cno) b
on a.cno=b.cno
and a.date = b.max_date


On Wed, Sep 10, 2014 at 3:39 PM, Nishant Kelkar <ni...@gmail.com>
wrote:

> Try something like this then:
>
> SELECT A.cno, A.sqno, A.sorted_dates[A.size-1] AS latest_date
> FROM
> (
> SELECT cno, sqno,
> SORT_ARRAY(COLLECT_SET(date)) AS sorted_dates, SIZE(COLLECT_SET(date)) AS
> size
> FROM table GROUP BY cno, sqno
> ) A;
>
> There are better ways of doing this, but this one's quick and dirty :)
>
> Best Regards,
> Nishant Kelkar
>
> On Wed, Sep 10, 2014 at 12:48 PM, Raj Hadoop <ha...@yahoo.com> wrote:
>
>> sort_array returns in ascending order. so the first element cannot be the
>> largest date. the last element is the largest date.
>>
>>
>>   On Wednesday, September 10, 2014 3:38 PM, Nishant Kelkar <
>> nishant.k02@gmail.com> wrote:
>>
>>
>> Hi Raj,
>>
>> You'll have to change the format of your date to something like
>> YYYY-MM-DD. For example, for "2-oct-2013" it will be 2013-10-02.
>>
>> Best Regards,
>> Nishant Kelkar
>>
>>
>>
>> On Wed, Sep 10, 2014 at 11:48 AM, Raj Hadoop <ha...@yahoo.com> wrote:
>>
>> The
>>
>> SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>>
>> is returning the lowest date. I need the largest date.
>>
>>
>>
>> --------------------------------------------
>> On Wed, 9/10/14, Raj Hadoop <ha...@yahoo.com> wrote:
>>
>>  Subject: Re: Remove duplicate records in Hive
>>  To: user@hive.apache.org
>>  Date: Wednesday, September 10, 2014, 2:41 PM
>>
>>  Thanks. I will try it.
>>  --------------------------------------------
>>  On Wed, 9/10/14, Nishant Kelkar <ni...@gmail.com>
>>  wrote:
>>
>>   Subject: Re: Remove
>>  duplicate records in Hive
>>   To: user@hive.apache.org,
>>  hadoopraj@yahoo.com
>>   Date: Wednesday, September 10, 2014, 1:59
>>  PM
>>
>>   Hi
>>
>>  Raj,
>>   You can do something
>>   along these lines:
>>
>>   SELECT
>>   cno, sqno,
>>  SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>>   FROM table GROUP BY cno, sqno;
>>   However, you have to make sure your
>>   date format is such that sorting it gives you
>>  the most
>>   recent date. The best way to do
>>  that is to have it in
>>   format:
>>  YYYY-MM-DD.
>>   Hope this helps.
>>   Best Regards,Nishant
>>
>>  Kelkar
>>   On Wed, Sep 10, 2014 at
>>   10:04 AM, Raj Hadoop <ha...@yahoo.com>
>>   wrote:
>>
>>
>>   Hi,
>>
>>
>>
>>   I have a requirement in Hive
>>  to remove duplicate records (
>>   they differ
>>  only by one column i.e a date column) and keep
>>   the latest date record.
>>
>>
>>
>>   Sample
>>  :
>>
>>   Hive Table :
>>
>>    d2 is a higher
>>
>>   cno,sqno,date
>>
>>
>>
>>   100 1 1-oct-2013
>>
>>   101 2 1-oct-2013
>>
>>   100 1 2-oct-2013
>>
>>   102 2 2-oct-2013
>>
>>
>>
>>
>>
>>   Output needed:
>>
>>
>>
>>   100 1 2-oct-2013
>>
>>   101 2 1-oct-2013
>>
>>   102 2 2-oct-2013
>>
>>
>>
>>   I am using
>>  Hive 0.11
>>
>>
>>
>>   Any suggestions please ?
>>
>>
>>
>>   Regards,
>>
>>
>>  Raj
>>
>>
>>
>>
>>
>>
>>
>

Re: Remove duplicate records in Hive

Posted by Nishant Kelkar <ni...@gmail.com>.

Try something like this then:

SELECT A.cno, A.sqno, A.sorted_dates[A.size-1] AS latest_date
FROM
(
SELECT cno, sqno,
SORT_ARRAY(COLLECT_SET(date)) AS sorted_dates, SIZE(COLLECT_SET(date)) AS
size
FROM table GROUP BY cno, sqno
) A;

There are better ways of doing this, but this one's quick and dirty :)

Best Regards,
Nishant Kelkar

On Wed, Sep 10, 2014 at 12:48 PM, Raj Hadoop <ha...@yahoo.com> wrote:

> sort_array returns in ascending order. so the first element cannot be the
> largest date. the last element is the largest date.
>
>
>   On Wednesday, September 10, 2014 3:38 PM, Nishant Kelkar <
> nishant.k02@gmail.com> wrote:
>
>
> Hi Raj,
>
> You'll have to change the format of your date to something like
> YYYY-MM-DD. For example, for "2-oct-2013" it will be 2013-10-02.
>
> Best Regards,
> Nishant Kelkar
>
>
>
> On Wed, Sep 10, 2014 at 11:48 AM, Raj Hadoop <ha...@yahoo.com> wrote:
>
> The
>
> SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>
> is returning the lowest date. I need the largest date.
>
>
>
> --------------------------------------------
> On Wed, 9/10/14, Raj Hadoop <ha...@yahoo.com> wrote:
>
>  Subject: Re: Remove duplicate records in Hive
>  To: user@hive.apache.org
>  Date: Wednesday, September 10, 2014, 2:41 PM
>
>  Thanks. I will try it.
>  --------------------------------------------
>  On Wed, 9/10/14, Nishant Kelkar <ni...@gmail.com>
>  wrote:
>
>   Subject: Re: Remove
>  duplicate records in Hive
>   To: user@hive.apache.org,
>  hadoopraj@yahoo.com
>   Date: Wednesday, September 10, 2014, 1:59
>  PM
>
>   Hi
>
>  Raj,
>   You can do something
>   along these lines:
>
>   SELECT
>   cno, sqno,
>  SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>   FROM table GROUP BY cno, sqno;
>   However, you have to make sure your
>   date format is such that sorting it gives you
>  the most
>   recent date. The best way to do
>  that is to have it in
>   format:
>  YYYY-MM-DD.
>   Hope this helps.
>   Best Regards,Nishant
>
>  Kelkar
>   On Wed, Sep 10, 2014 at
>   10:04 AM, Raj Hadoop <ha...@yahoo.com>
>   wrote:
>
>
>   Hi,
>
>
>
>   I have a requirement in Hive
>  to remove duplicate records (
>   they differ
>  only by one column i.e a date column) and keep
>   the latest date record.
>
>
>
>   Sample
>  :
>
>   Hive Table :
>
>    d2 is a higher
>
>   cno,sqno,date
>
>
>
>   100 1 1-oct-2013
>
>   101 2 1-oct-2013
>
>   100 1 2-oct-2013
>
>   102 2 2-oct-2013
>
>
>
>
>
>   Output needed:
>
>
>
>   100 1 2-oct-2013
>
>   101 2 1-oct-2013
>
>   102 2 2-oct-2013
>
>
>
>   I am using
>  Hive 0.11
>
>
>
>   Any suggestions please ?
>
>
>
>   Regards,
>
>
>  Raj
>
>
>
>
>
>
>

Re: Remove duplicate records in Hive

Posted by Raj Hadoop <ha...@yahoo.com>.

sort_array returns in ascending order. so the first element cannot be the largest date. the last element is the largest date.



On Wednesday, September 10, 2014 3:38 PM, Nishant Kelkar <ni...@gmail.com> wrote:
 


Hi Raj,

You'll have to change the format of your date to something like YYYY-MM-DD. For example, for "2-oct-2013" it will be 2013-10-02.

Best Regards,
Nishant Kelkar





On Wed, Sep 10, 2014 at 11:48 AM, Raj Hadoop <ha...@yahoo.com> wrote:

The
>
>SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>
>is returning the lowest date. I need the largest date.
>
>
>
>--------------------------------------------
>On Wed, 9/10/14, Raj Hadoop <ha...@yahoo.com> wrote:
>
> Subject: Re: Remove duplicate records in Hive
> To: user@hive.apache.org
> Date: Wednesday, September 10, 2014, 2:41 PM
>
>
> Thanks. I will try it.
> --------------------------------------------
> On Wed, 9/10/14, Nishant Kelkar <ni...@gmail.com>
> wrote:
>
>  Subject: Re: Remove
> duplicate records in Hive
>  To: user@hive.apache.org,
> hadoopraj@yahoo.com
>  Date: Wednesday, September 10, 2014, 1:59
> PM
>
>  Hi
>
> Raj, 
>  You can do something
>  along these lines: 
>
>  SELECT
>  cno, sqno,
> SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>  FROM table GROUP BY cno, sqno;
>  However, you have to make sure your
>  date format is such that sorting it gives you
> the most
>  recent date. The best way to do
> that is to have it in
>  format:
> YYYY-MM-DD.
>  Hope this helps.
>  Best Regards,Nishant
>
> Kelkar
>  On Wed, Sep 10, 2014 at
>  10:04 AM, Raj Hadoop <ha...@yahoo.com>
>  wrote:
>
>
>  Hi,
>
>
>
>  I have a requirement in Hive
> to remove duplicate records (
>  they differ
> only by one column i.e a date column) and keep
>  the latest date record.
>
>
>
>  Sample
> :
>
>  Hive Table :
>
>   d2 is a higher
>
>  cno,sqno,date
>
>
>
>  100 1 1-oct-2013
>
>  101 2 1-oct-2013
>
>  100 1 2-oct-2013
>
>  102 2 2-oct-2013
>
>
>
>
>
>  Output needed:
>
>
>
>  100 1 2-oct-2013
>
>  101 2 1-oct-2013
>
>  102 2 2-oct-2013
>
>
>
>  I am using
> Hive 0.11
>
>
>
>  Any suggestions please ?
>
>
>
>  Regards,
>
>
> Raj
>
>
>
>

Re: Remove duplicate records in Hive

Posted by Nishant Kelkar <ni...@gmail.com>.

Hi Raj,

You'll have to change the format of your date to something like YYYY-MM-DD.
For example, for "2-oct-2013" it will be 2013-10-02.

Best Regards,
Nishant Kelkar



On Wed, Sep 10, 2014 at 11:48 AM, Raj Hadoop <ha...@yahoo.com> wrote:

> The
>
> SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>
> is returning the lowest date. I need the largest date.
>
>
>
> --------------------------------------------
> On Wed, 9/10/14, Raj Hadoop <ha...@yahoo.com> wrote:
>
>  Subject: Re: Remove duplicate records in Hive
>  To: user@hive.apache.org
>  Date: Wednesday, September 10, 2014, 2:41 PM
>
>  Thanks. I will try it.
>  --------------------------------------------
>  On Wed, 9/10/14, Nishant Kelkar <ni...@gmail.com>
>  wrote:
>
>   Subject: Re: Remove
>  duplicate records in Hive
>   To: user@hive.apache.org,
>  hadoopraj@yahoo.com
>   Date: Wednesday, September 10, 2014, 1:59
>  PM
>
>   Hi
>
>  Raj,
>   You can do something
>   along these lines:
>
>   SELECT
>   cno, sqno,
>  SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>   FROM table GROUP BY cno, sqno;
>   However, you have to make sure your
>   date format is such that sorting it gives you
>  the most
>   recent date. The best way to do
>  that is to have it in
>   format:
>  YYYY-MM-DD.
>   Hope this helps.
>   Best Regards,Nishant
>
>  Kelkar
>   On Wed, Sep 10, 2014 at
>   10:04 AM, Raj Hadoop <ha...@yahoo.com>
>   wrote:
>
>
>   Hi,
>
>
>
>   I have a requirement in Hive
>  to remove duplicate records (
>   they differ
>  only by one column i.e a date column) and keep
>   the latest date record.
>
>
>
>   Sample
>  :
>
>   Hive Table :
>
>    d2 is a higher
>
>   cno,sqno,date
>
>
>
>   100 1 1-oct-2013
>
>   101 2 1-oct-2013
>
>   100 1 2-oct-2013
>
>   102 2 2-oct-2013
>
>
>
>
>
>   Output needed:
>
>
>
>   100 1 2-oct-2013
>
>   101 2 1-oct-2013
>
>   102 2 2-oct-2013
>
>
>
>   I am using
>  Hive 0.11
>
>
>
>   Any suggestions please ?
>
>
>
>   Regards,
>
>
>  Raj
>
>
>
>

Re: Remove duplicate records in Hive

Posted by Raj Hadoop <ha...@yahoo.com>.

The

SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date

is returning the lowest date. I need the largest date.



--------------------------------------------
On Wed, 9/10/14, Raj Hadoop <ha...@yahoo.com> wrote:

 Subject: Re: Remove duplicate records in Hive
 To: user@hive.apache.org
 Date: Wednesday, September 10, 2014, 2:41 PM
 
 Thanks. I will try it.
 --------------------------------------------
 On Wed, 9/10/14, Nishant Kelkar <ni...@gmail.com>
 wrote:
 
  Subject: Re: Remove
 duplicate records in Hive
  To: user@hive.apache.org,
 hadoopraj@yahoo.com
  Date: Wednesday, September 10, 2014, 1:59
 PM
  
  Hi
 
 Raj, 
  You can do something
  along these lines: 
  
  SELECT
  cno, sqno,
 SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
  FROM table GROUP BY cno, sqno;
  However, you have to make sure your
  date format is such that sorting it gives you
 the most
  recent date. The best way to do
 that is to have it in
  format:
 YYYY-MM-DD.
  Hope this helps.
  Best Regards,Nishant
 
 Kelkar
  On Wed, Sep 10, 2014 at
  10:04 AM, Raj Hadoop <ha...@yahoo.com>
  wrote:
  
  
  Hi,
  
  
  
  I have a requirement in Hive
 to remove duplicate records (
  they differ
 only by one column i.e a date column) and keep
  the latest date record.
  
  
  
  Sample
 :
  
  Hive Table :
  
   d2 is a higher
  
  cno,sqno,date
  
  
  
  100 1 1-oct-2013
  
  101 2 1-oct-2013
  
  100 1 2-oct-2013
  
  102 2 2-oct-2013
  
  
  
  
  
  Output needed:
  
  
  
  100 1 2-oct-2013
  
  101 2 1-oct-2013
  
  102 2 2-oct-2013
  
  
  
  I am using
 Hive 0.11
  
  
  
  Any suggestions please ?
  
  
  
  Regards,
  
 
 Raj

Re: Remove duplicate records in Hive

Posted by Raj Hadoop <ha...@yahoo.com>.

Thanks. I will try it.
--------------------------------------------
On Wed, 9/10/14, Nishant Kelkar <ni...@gmail.com> wrote:

 Subject: Re: Remove duplicate records in Hive
 To: user@hive.apache.org, hadoopraj@yahoo.com
 Date: Wednesday, September 10, 2014, 1:59 PM

 Hi
 Raj, 
 You can do something
 along these lines: 

 SELECT
 cno, sqno, SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
 FROM table GROUP BY cno, sqno;
 However, you have to make sure your
 date format is such that sorting it gives you the most
 recent date. The best way to do that is to have it in
 format: YYYY-MM-DD.
 Hope this helps.
 Best Regards,Nishant
 Kelkar
 On Wed, Sep 10, 2014 at
 10:04 AM, Raj Hadoop <ha...@yahoo.com>
 wrote:

 Hi,

 I have a requirement in Hive to remove duplicate records (
 they differ only by one column i.e a date column) and keep
 the latest date record.

 Sample :

 Hive Table :

  d2 is a higher

 cno,sqno,date

 100 1 1-oct-2013

 101 2 1-oct-2013

 100 1 2-oct-2013

 102 2 2-oct-2013

 Output needed:

 100 1 2-oct-2013

 101 2 1-oct-2013

 102 2 2-oct-2013

 I am using Hive 0.11

 Any suggestions please ?

 Regards,

 Raj

Re: Remove duplicate records in Hive

Posted by Nishant Kelkar <ni...@gmail.com>.

Hi Raj,

You can do something along these lines:

SELECT cno, sqno, SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date FROM
table GROUP BY cno, sqno;

However, you have to make sure your date format is such that sorting it
gives you the most recent date. The best way to do that is to have it in
format: YYYY-MM-DD.

Hope this helps.

Best Regards,
Nishant Kelkar

On Wed, Sep 10, 2014 at 10:04 AM, Raj Hadoop <ha...@yahoo.com> wrote:

>
> Hi,
>
> I have a requirement in Hive to remove duplicate records ( they differ
> only by one column i.e a date column) and keep the latest date record.
>
> Sample :
> Hive Table :
>  d2 is a higher
> cno,sqno,date
>
> 100 1 1-oct-2013
> 101 2 1-oct-2013
> 100 1 2-oct-2013
> 102 2 2-oct-2013
>
>
> Output needed:
>
> 100 1 2-oct-2013
> 101 2 1-oct-2013
> 102 2 2-oct-2013
>
> I am using Hive 0.11
>
> Any suggestions please ?
>
> Regards,
> Raj
>

Re: Remove duplicate records in Hive

Posted by Kevin Weiler <Ke...@imc-chicago.com>.

If you can just query the table for your results, you can do a SELECT DISTINCT instead of just a SELECT. If you give me a bit more information about where the duplicate data is coming from, I can provide a bit more detail. You can come see me on the end of desk.

--
Kevin Weiler
IT
IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/
Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.weiler@imc-chicago.com<ma...@imc-chicago.com>

On Sep 10, 2014, at 12:04 PM, Raj Hadoop <ha...@yahoo.com>> wrote:


Hi,

I have a requirement in Hive to remove duplicate records ( they differ only by one column i.e a date column) and keep the latest date record.

Sample :
Hive Table :
d2 is a higher
cno,sqno,date

100 1 1-oct-2013
101 2 1-oct-2013
100 1 2-oct-2013
102 2 2-oct-2013


Output needed:

100 1 2-oct-2013
101 2 1-oct-2013
102 2 2-oct-2013

I am using Hive 0.11

Any suggestions please ?

Regards,
Raj


________________________________

The information in this e-mail is intended only for the person or entity to which it is addressed.

It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by "reply" and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets & Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e-mail to the sender and permanently delete this message and any attachments.

Messages and attachments are scanned for all known viruses. Always scan attachments before opening them.