You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Philippe Kernévez <pk...@octo.com> on 2015/02/17 18:31:17 UTC

Remove duplicated rows

Hi,

I have a table (named DEDUPLICATED) that contains about 1 billions rows.
Each day I receive a new file  with about 5 millions rows. About 10% of
those rows are duplicated (duplication occur inside a daily file but also
between files).
There are about 30 fields in the files.

As for now I deduplicate all the data every day with the following request :

  INSERT OVERWRITE TABLE DEDUPLICATED
    SELECT cl.*
    FROM (
        SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
        FROM DAILY d  -- DAILY is an external table that contains all the
daily files
        ) cl
    WHERE cl.pos = 1

On the mailing list I saw another approach base on a "group by KEY" request
and use a 'select MAX(xxx)' for all non-key fields.

My first question is : which of the both seems to be better ?
(the second one is quite harder to maintain as all the fields should be
explicitly written in the request).



The second question is : what is the best way to do the deduplication and
import on a incremental approach ?
Something like that ?
  INSERT TABLE DEDUPLICATED
    SELECT cl.*
    FROM (
        SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
        FROM LAST_DAILY_FILE d     -- ONLY the last file
        ) cl
    WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside
the last file
   AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
between the last file and all the existing files

And the last question : for the last request, does an index on KEY help
with hive as it can help on a classical relational database ?

Regards,
Philippe



-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com

Re: Remove duplicated rows

Posted by Philippe Kernévez <pk...@octo.com>.

Hi Dev and thank you for you

On Wed, Feb 18, 2015 at 11:31 AM, Devopam Mittra <de...@gmail.com> wrote:

> hi Philippe,
> Performance improvement has two factors : 1. availability (read abundance)
> of resources 2.need for speed
> All "advise" usually is to address mainly these two factors , as I have
> usually seen till this far.
>
> Since you are doing a full scan each day for identify dedups I suggested
> keeping data uncompressed (in fact you may do the same only with a separate
> table with key columns itself)
>

I there a difference - in term of performance - between another table with
only the key columns and an index ?


>
> Apologies that the post didn't come out clear in terms of scalability and
> you are definitely the best judge of the fit-gap analysis , being closest
> to the issues. Dynamic partitions allows selective target for the table and
> not a FTS so I presumed that scaling up should never be an issue ...
>
> Couldn't be of much help here I guess :(
>

Thanks a lot for your contribution.
Philippe


> regards
> Dev
>
>
> On Wed, Feb 18, 2015 at 3:02 PM, Philippe Kernévez <pk...@octo.com>
> wrote:
>
>> Hi Dev,
>>
>> > "Only suggestion is that please don't compress the files that are
>> linked in the external table" .
>> Why do you suggest that ? Presently, I divide my import time by 2 by
>> using 'gz' files instead of the uncompressed 'csv'.
>>
>> > "You may want to leverage dynamic partition"
>> I read you post, and I wonder if it can scale. It seems that you have
>> hourly and daily aggregation. More or less 24*365=8'800 raw per year.
>> We have about 2 billions (2E9) raws per year.
>> Do you know if you approach can scale on a such factor ?
>> I also wonder about the impact of the partition columns on the map reduce
>> job, they will probably produce a big number of files to produce those
>> partitions.
>>
>>
>> Regards,
>> Philippe
>>
>>
>>
>> On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra <de...@gmail.com>
>> wrote:
>>
>>> dear Philippe,
>>> I asked specifically the version so as to understand what built-in
>>> features you can leverage to your benefit.
>>>
>>> There are two ways that I can think of - to help you in your case :
>>> 1. Use RANK as you have already noted in your original note - this will
>>> be faster and more appropriate . Only suggestion is that please don't
>>> compress the files that are linked in the external table.
>>>
>>> 2. You may want to leverage dynamic partition feature to your aid for
>>> managing the duplicate records . I like it and use it extensively now a
>>> days after 0.13 onward. Brief concept : dynamically partition the table on
>>> one low skew column + 'key' column . Then simply INSERT OVERWRITE the
>>> 'delta' and it will seamlessly overwrite only the affected rows and do the
>>> redistribution of data in the table internally without you having to bother
>>> about the code or the burden to do it. You may refer to this blog I wrote
>>> quite some time back : http://linkd.in/1Fq3wdb
>>> This technique will cost you a little overhead time with the MR job
>>> getting kicked off and all , but is painless and seamless , so I like it
>>> this way.
>>>
>>> regards
>>> Dev
>>>
>>> On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pk...@octo.com>
>>> wrote:
>>>
>>>> Hi Dev,
>>>>
>>>> I'm using hive 0.14.
>>>>
>>>> Regards,
>>>>
>>>> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Can you please state the hive version
>>>>>
>>>>> regards
>>>>> Dev
>>>>> +91 958 305 9899
>>>>>
>>>>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pk...@octo.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a table (named DEDUPLICATED) that contains about 1 billions
>>>>> rows.
>>>>> Each day I receive a new file  with about 5 millions rows. About 10%
>>>>> of those rows are duplicated (duplication occur inside a daily file but
>>>>> also between files).
>>>>> There are about 30 fields in the files.
>>>>>
>>>>> As for now I deduplicate all the data every day with the following
>>>>> request :
>>>>>
>>>>>   INSERT OVERWRITE TABLE DEDUPLICATED
>>>>>     SELECT cl.*
>>>>>     FROM (
>>>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>>>         FROM DAILY d  -- DAILY is an external table that contains all
>>>>> the daily files
>>>>>         ) cl
>>>>>     WHERE cl.pos = 1
>>>>>
>>>>> On the mailing list I saw another approach base on a "group by KEY"
>>>>> request and use a 'select MAX(xxx)' for all non-key fields.
>>>>>
>>>>> My first question is : which of the both seems to be better ?
>>>>> (the second one is quite harder to maintain as all the fields should
>>>>> be explicitly written in the request).
>>>>>
>>>>>
>>>>>
>>>>> The second question is : what is the best way to do the deduplication
>>>>> and import on a incremental approach ?
>>>>> Something like that ?
>>>>>   INSERT TABLE DEDUPLICATED
>>>>>     SELECT cl.*
>>>>>     FROM (
>>>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>>>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>>>>>         ) cl
>>>>>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication
>>>>> inside the last file
>>>>>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove
>>>>> duplication between the last file and all the existing files
>>>>>
>>>>> And the last question : for the last request, does an index on KEY
>>>>> help with hive as it can help on a classical relational database ?
>>>>>
>>>>> Regards,
>>>>> Philippe
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Philippe Kernévez
>>>>>
>>>>>
>>>>>
>>>>> Directeur technique (Suisse),
>>>>> pkernevez@octo.com
>>>>> +41 79 888 33 32
>>>>>
>>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>>> OCTO Technology http://www.octo.com
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Philippe Kernévez
>>>>
>>>>
>>>>
>>>> Directeur technique (Suisse),
>>>> pkernevez@octo.com
>>>> +41 79 888 33 32
>>>>
>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>> OCTO Technology http://www.octo.com
>>>>
>>>
>>>
>>>
>>> --
>>> Devopam Mittra
>>> Life and Relations are not binary
>>>
>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkernevez@octo.com
>> +41 79 888 33 32
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.com
>>
>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>



-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com

Re: Remove duplicated rows

Posted by Devopam Mittra <de...@gmail.com>.

hi Philippe,
Performance improvement has two factors : 1. availability (read abundance)
of resources 2.need for speed
All "advise" usually is to address mainly these two factors , as I have
usually seen till this far.

Since you are doing a full scan each day for identify dedups I suggested
keeping data uncompressed (in fact you may do the same only with a separate
table with key columns itself)

Apologies that the post didn't come out clear in terms of scalability and
you are definitely the best judge of the fit-gap analysis , being closest
to the issues. Dynamic partitions allows selective target for the table and
not a FTS so I presumed that scaling up should never be an issue ...

Couldn't be of much help here I guess :(
regards
Dev


On Wed, Feb 18, 2015 at 3:02 PM, Philippe Kernévez <pk...@octo.com>
wrote:

> Hi Dev,
>
> > "Only suggestion is that please don't compress the files that are
> linked in the external table" .
> Why do you suggest that ? Presently, I divide my import time by 2 by using
> 'gz' files instead of the uncompressed 'csv'.
>
> > "You may want to leverage dynamic partition"
> I read you post, and I wonder if it can scale. It seems that you have
> hourly and daily aggregation. More or less 24*365=8'800 raw per year.
> We have about 2 billions (2E9) raws per year.
> Do you know if you approach can scale on a such factor ?
> I also wonder about the impact of the partition columns on the map reduce
> job, they will probably produce a big number of files to produce those
> partitions.
>
>
> Regards,
> Philippe
>
>
>
> On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra <de...@gmail.com> wrote:
>
>> dear Philippe,
>> I asked specifically the version so as to understand what built-in
>> features you can leverage to your benefit.
>>
>> There are two ways that I can think of - to help you in your case :
>> 1. Use RANK as you have already noted in your original note - this will
>> be faster and more appropriate . Only suggestion is that please don't
>> compress the files that are linked in the external table.
>>
>> 2. You may want to leverage dynamic partition feature to your aid for
>> managing the duplicate records . I like it and use it extensively now a
>> days after 0.13 onward. Brief concept : dynamically partition the table on
>> one low skew column + 'key' column . Then simply INSERT OVERWRITE the
>> 'delta' and it will seamlessly overwrite only the affected rows and do the
>> redistribution of data in the table internally without you having to bother
>> about the code or the burden to do it. You may refer to this blog I wrote
>> quite some time back : http://linkd.in/1Fq3wdb
>> This technique will cost you a little overhead time with the MR job
>> getting kicked off and all , but is painless and seamless , so I like it
>> this way.
>>
>> regards
>> Dev
>>
>> On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pk...@octo.com>
>> wrote:
>>
>>> Hi Dev,
>>>
>>> I'm using hive 0.14.
>>>
>>> Regards,
>>>
>>> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <de...@gmail.com>
>>> wrote:
>>>
>>>> Can you please state the hive version
>>>>
>>>> regards
>>>> Dev
>>>> +91 958 305 9899
>>>>
>>>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pk...@octo.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
>>>> Each day I receive a new file  with about 5 millions rows. About 10% of
>>>> those rows are duplicated (duplication occur inside a daily file but also
>>>> between files).
>>>> There are about 30 fields in the files.
>>>>
>>>> As for now I deduplicate all the data every day with the following
>>>> request :
>>>>
>>>>   INSERT OVERWRITE TABLE DEDUPLICATED
>>>>     SELECT cl.*
>>>>     FROM (
>>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>>         FROM DAILY d  -- DAILY is an external table that contains all
>>>> the daily files
>>>>         ) cl
>>>>     WHERE cl.pos = 1
>>>>
>>>> On the mailing list I saw another approach base on a "group by KEY"
>>>> request and use a 'select MAX(xxx)' for all non-key fields.
>>>>
>>>> My first question is : which of the both seems to be better ?
>>>> (the second one is quite harder to maintain as all the fields should be
>>>> explicitly written in the request).
>>>>
>>>>
>>>>
>>>> The second question is : what is the best way to do the deduplication
>>>> and import on a incremental approach ?
>>>> Something like that ?
>>>>   INSERT TABLE DEDUPLICATED
>>>>     SELECT cl.*
>>>>     FROM (
>>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>>>>         ) cl
>>>>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication
>>>> inside the last file
>>>>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove
>>>> duplication between the last file and all the existing files
>>>>
>>>> And the last question : for the last request, does an index on KEY help
>>>> with hive as it can help on a classical relational database ?
>>>>
>>>> Regards,
>>>> Philippe
>>>>
>>>>
>>>>
>>>> --
>>>> Philippe Kernévez
>>>>
>>>>
>>>>
>>>> Directeur technique (Suisse),
>>>> pkernevez@octo.com
>>>> +41 79 888 33 32
>>>>
>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>> OCTO Technology http://www.octo.com
>>>>
>>>>
>>>
>>>
>>> --
>>> Philippe Kernévez
>>>
>>>
>>>
>>> Directeur technique (Suisse),
>>> pkernevez@octo.com
>>> +41 79 888 33 32
>>>
>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>> OCTO Technology http://www.octo.com
>>>
>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkernevez@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com
>



-- 
Devopam Mittra
Life and Relations are not binary

Re: Remove duplicated rows

Posted by Philippe Kernévez <pk...@octo.com>.

Hi Dev,

> "Only suggestion is that please don't compress the files that are linked
in the external table" .
Why do you suggest that ? Presently, I divide my import time by 2 by using
'gz' files instead of the uncompressed 'csv'.

> "You may want to leverage dynamic partition"
I read you post, and I wonder if it can scale. It seems that you have
hourly and daily aggregation. More or less 24*365=8'800 raw per year.
We have about 2 billions (2E9) raws per year.
Do you know if you approach can scale on a such factor ?
I also wonder about the impact of the partition columns on the map reduce
job, they will probably produce a big number of files to produce those
partitions.


Regards,
Philippe



On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra <de...@gmail.com> wrote:

> dear Philippe,
> I asked specifically the version so as to understand what built-in
> features you can leverage to your benefit.
>
> There are two ways that I can think of - to help you in your case :
> 1. Use RANK as you have already noted in your original note - this will be
> faster and more appropriate . Only suggestion is that please don't compress
> the files that are linked in the external table.
>
> 2. You may want to leverage dynamic partition feature to your aid for
> managing the duplicate records . I like it and use it extensively now a
> days after 0.13 onward. Brief concept : dynamically partition the table on
> one low skew column + 'key' column . Then simply INSERT OVERWRITE the
> 'delta' and it will seamlessly overwrite only the affected rows and do the
> redistribution of data in the table internally without you having to bother
> about the code or the burden to do it. You may refer to this blog I wrote
> quite some time back : http://linkd.in/1Fq3wdb
> This technique will cost you a little overhead time with the MR job
> getting kicked off and all , but is painless and seamless , so I like it
> this way.
>
> regards
> Dev
>
> On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pk...@octo.com>
> wrote:
>
>> Hi Dev,
>>
>> I'm using hive 0.14.
>>
>> Regards,
>>
>> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <de...@gmail.com>
>> wrote:
>>
>>> Can you please state the hive version
>>>
>>> regards
>>> Dev
>>> +91 958 305 9899
>>>
>>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pk...@octo.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
>>> Each day I receive a new file  with about 5 millions rows. About 10% of
>>> those rows are duplicated (duplication occur inside a daily file but also
>>> between files).
>>> There are about 30 fields in the files.
>>>
>>> As for now I deduplicate all the data every day with the following
>>> request :
>>>
>>>   INSERT OVERWRITE TABLE DEDUPLICATED
>>>     SELECT cl.*
>>>     FROM (
>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>         FROM DAILY d  -- DAILY is an external table that contains all
>>> the daily files
>>>         ) cl
>>>     WHERE cl.pos = 1
>>>
>>> On the mailing list I saw another approach base on a "group by KEY"
>>> request and use a 'select MAX(xxx)' for all non-key fields.
>>>
>>> My first question is : which of the both seems to be better ?
>>> (the second one is quite harder to maintain as all the fields should be
>>> explicitly written in the request).
>>>
>>>
>>>
>>> The second question is : what is the best way to do the deduplication
>>> and import on a incremental approach ?
>>> Something like that ?
>>>   INSERT TABLE DEDUPLICATED
>>>     SELECT cl.*
>>>     FROM (
>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>>>         ) cl
>>>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside
>>> the last file
>>>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
>>> between the last file and all the existing files
>>>
>>> And the last question : for the last request, does an index on KEY help
>>> with hive as it can help on a classical relational database ?
>>>
>>> Regards,
>>> Philippe
>>>
>>>
>>>
>>> --
>>> Philippe Kernévez
>>>
>>>
>>>
>>> Directeur technique (Suisse),
>>> pkernevez@octo.com
>>> +41 79 888 33 32
>>>
>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>> OCTO Technology http://www.octo.com
>>>
>>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkernevez@octo.com
>> +41 79 888 33 32
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.com
>>
>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>



-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com

Re: Remove duplicated rows

Posted by Devopam Mittra <de...@gmail.com>.

dear Philippe,
I asked specifically the version so as to understand what built-in features
you can leverage to your benefit.

There are two ways that I can think of - to help you in your case :
1. Use RANK as you have already noted in your original note - this will be
faster and more appropriate . Only suggestion is that please don't compress
the files that are linked in the external table.

2. You may want to leverage dynamic partition feature to your aid for
managing the duplicate records . I like it and use it extensively now a
days after 0.13 onward. Brief concept : dynamically partition the table on
one low skew column + 'key' column . Then simply INSERT OVERWRITE the
'delta' and it will seamlessly overwrite only the affected rows and do the
redistribution of data in the table internally without you having to bother
about the code or the burden to do it. You may refer to this blog I wrote
quite some time back : http://linkd.in/1Fq3wdb
This technique will cost you a little overhead time with the MR job getting
kicked off and all , but is painless and seamless , so I like it this way.

regards
Dev

On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pk...@octo.com>
wrote:

> Hi Dev,
>
> I'm using hive 0.14.
>
> Regards,
>
> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <de...@gmail.com> wrote:
>
>> Can you please state the hive version
>>
>> regards
>> Dev
>> +91 958 305 9899
>>
>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pk...@octo.com>
>> wrote:
>>
>> Hi,
>>
>> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
>> Each day I receive a new file  with about 5 millions rows. About 10% of
>> those rows are duplicated (duplication occur inside a daily file but also
>> between files).
>> There are about 30 fields in the files.
>>
>> As for now I deduplicate all the data every day with the following
>> request :
>>
>>   INSERT OVERWRITE TABLE DEDUPLICATED
>>     SELECT cl.*
>>     FROM (
>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>         FROM DAILY d  -- DAILY is an external table that contains all the
>> daily files
>>         ) cl
>>     WHERE cl.pos = 1
>>
>> On the mailing list I saw another approach base on a "group by KEY"
>> request and use a 'select MAX(xxx)' for all non-key fields.
>>
>> My first question is : which of the both seems to be better ?
>> (the second one is quite harder to maintain as all the fields should be
>> explicitly written in the request).
>>
>>
>>
>> The second question is : what is the best way to do the deduplication and
>> import on a incremental approach ?
>> Something like that ?
>>   INSERT TABLE DEDUPLICATED
>>     SELECT cl.*
>>     FROM (
>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>>         ) cl
>>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside
>> the last file
>>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
>> between the last file and all the existing files
>>
>> And the last question : for the last request, does an index on KEY help
>> with hive as it can help on a classical relational database ?
>>
>> Regards,
>> Philippe
>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkernevez@octo.com
>> +41 79 888 33 32
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.com
>>
>>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkernevez@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com
>

-- 
Devopam Mittra
Life and Relations are not binary

Re: Remove duplicated rows

Posted by Philippe Kernévez <pk...@octo.com>.

Hi Dev,

I'm using hive 0.14.

Regards,

On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <de...@gmail.com> wrote:

> Can you please state the hive version
>
> regards
> Dev
> +91 958 305 9899
>
> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pk...@octo.com>
> wrote:
>
> Hi,
>
> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
> Each day I receive a new file  with about 5 millions rows. About 10% of
> those rows are duplicated (duplication occur inside a daily file but also
> between files).
> There are about 30 fields in the files.
>
> As for now I deduplicate all the data every day with the following request
> :
>
>   INSERT OVERWRITE TABLE DEDUPLICATED
>     SELECT cl.*
>     FROM (
>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>         FROM DAILY d  -- DAILY is an external table that contains all the
> daily files
>         ) cl
>     WHERE cl.pos = 1
>
> On the mailing list I saw another approach base on a "group by KEY"
> request and use a 'select MAX(xxx)' for all non-key fields.
>
> My first question is : which of the both seems to be better ?
> (the second one is quite harder to maintain as all the fields should be
> explicitly written in the request).
>
>
>
> The second question is : what is the best way to do the deduplication and
> import on a incremental approach ?
> Something like that ?
>   INSERT TABLE DEDUPLICATED
>     SELECT cl.*
>     FROM (
>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>         ) cl
>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside
> the last file
>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
> between the last file and all the existing files
>
> And the last question : for the last request, does an index on KEY help
> with hive as it can help on a classical relational database ?
>
> Regards,
> Philippe
>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkernevez@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com
>
>


-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com

Re: Remove duplicated rows

Posted by Devopam Mittra <de...@gmail.com>.

Can you please state the hive version 

regards
Dev
+91 958 305 9899

> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pk...@octo.com> wrote:
> 
> Hi,
> 
> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
> Each day I receive a new file  with about 5 millions rows. About 10% of those rows are duplicated (duplication occur inside a daily file but also between files).
> There are about 30 fields in the files.
> 
> As for now I deduplicate all the data every day with the following request :
> 
>   INSERT OVERWRITE TABLE DEDUPLICATED
>     SELECT cl.* 
>     FROM (
>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos 
>         FROM DAILY d  -- DAILY is an external table that contains all the daily files
>         ) cl
>     WHERE cl.pos = 1    
>  
> On the mailing list I saw another approach base on a "group by KEY" request and use a 'select MAX(xxx)' for all non-key fields.
> 
> My first question is : which of the both seems to be better ?
> (the second one is quite harder to maintain as all the fields should be explicitly written in the request).
> 
> 
> 
> The second question is : what is the best way to do the deduplication and import on a incremental approach ?
> Something like that ?
>   INSERT TABLE DEDUPLICATED
>     SELECT cl.* 
>     FROM (
>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos 
>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>         ) cl
>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside the last file
>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication between the last file and all the existing files
> 
> And the last question : for the last request, does an index on KEY help with hive as it can help on a classical relational database ?
> 
> Regards,
> Philippe
> 
> 
> 
> -- 
> Philippe Kernévez
> 
> 
> 
> Directeur technique (Suisse), 
> pkernevez@octo.com
> +41 79 888 33 32
> 
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com