You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ivan Kruglov <iv...@booking.com> on 2013/09/23 16:48:34 UTC

Creating partition with __HIVE_DEFAULT_PARTITION__ value

Hello to everyone,

I'm working on the task of syncing data between two tables which have similar structure (read the same set of partitions). The tables are in different data centers and one table is a backup copy of another one. I'm trying to achieve this goal through distcp-ing data into target DC in temporary folder, recreating all needed partitions in target table and moving files from temporary place to final place. But I'm stuck on issue of creating partitions with value ' __HIVE_DEFAULT_PARTITION__'

So, my question is: Is it possible in hive to manually create partition with '__HIVE_DEFAULT_PARTITION__' value?

Neither of this way work: 
ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition=);
ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition='');
ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition='__HIVE_DEFAULT_PARTITION__');

Thank you.
Ivan Kruglov.

Re: Creating partition with __HIVE_DEFAULT_PARTITION__ value

Posted by Nitin Pawar <ni...@gmail.com>.
If  I understand correctly, this is what you are trying to do.

you have a data center where the data is written on a hive table (Data
center A)
you have another data center where you want to take backup of table from
data center A

you are using dist-cp to transfer data from data center A to B

if you are replicating data in exact same directories as they are in source
data center, easy option would be setting up meta store replication.




On Mon, Sep 23, 2013 at 8:18 PM, Ivan Kruglov <iv...@booking.com>wrote:

> Hello to everyone,
>
> I'm working on the task of syncing data between two tables which have
> similar structure (read the same set of partitions). The tables are in
> different data centers and one table is a backup copy of another one. I'm
> trying to achieve this goal through distcp-ing data into target DC in
> temporary folder, recreating all needed partitions in target table and
> moving files from temporary place to final place. But I'm stuck on issue of
> creating partitions with value ' __HIVE_DEFAULT_PARTITION__'
>
> So, my question is: Is it possible in hive to manually create partition
> with '__HIVE_DEFAULT_PARTITION__' value?
>
> Neither of this way work:
> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition=);
> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition='');
> ALTER TABLE table ADD IF NOT EXISTS PARTITION
> (partition='__HIVE_DEFAULT_PARTITION__');
>
> Thank you.
> Ivan Kruglov.




-- 
Nitin Pawar

Re: Creating partition with __HIVE_DEFAULT_PARTITION__ value

Posted by Stephen Sprague <sp...@gmail.com>.
so msck repair table + dynamic partitioning semantics looks like it fits
the bill for you.

yeah, 300K partitions. That's getting up there on the scale of things with
hive i'd say and close to over-partitioning.   for archival purposes maybe
older data doesn't need such a fine grained partition?  something to think
about anyway.

anyhoo, glad you found a solution!



On Tue, Sep 24, 2013 at 3:07 AM, Ivan Kruglov <iv...@booking.com>wrote:

> Hi everyone,
>
> Thank you for your answers.
>
> On 24.09.2013, at 0:36, Stephen Sprague <sp...@gmail.com> wrote:
>
> If its any help I've done this kind of thing frequently:
>
> 1. create the table on the new cluster.
>
> 2. distcp the data right into the hdfs directory where the table resides
> on the new cluster - no temp storage required.
>
> 3. run this hive command:   msck repair table <table>;   -- this command
> will create your partitions for you - its pretty slick that way.
>
>
> Let us know how it goes.
>
>
> I was thinking about this case. But it doesn't work well for me. The thing
> is that the table is giant. It's about 200TB and has about 300K partitions.
> So, MSCK REPAIR TABLE takes forever to compete. And I would need to run it
> every day (I'm doing incremental-like distcp-ing). However, I can go
> another way. I can distcp data into temporary table which has small amount
> of partitions, run MSCK against it and then do something like "INSERT
> OVERWRITE TABLE target_table PARTITION(….) SELECT * FROM tmp_table". It
> should work.
>
>
>
> On Mon, Sep 23, 2013 at 10:46 AM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> Did you try ALTER TABLE table ADD IF NOT EXISTS PARTITION
>> (partition=NULL);
>>
>> If that does not work you will need to create a dynamic partition type
>> query that will create the dummy partition. File a jira if the above syntax
>> does not work. There should be SOME way to create the default partition by
>> hand.
>>
>
> Yes, I've tried it and it doesn't work as well. I will file a ticket.
>
>
>>
>> On Mon, Sep 23, 2013 at 10:48 AM, Ivan Kruglov <iv...@booking.com>wrote:
>>
>>> Hello to everyone,
>>>
>>> I'm working on the task of syncing data between two tables which have
>>> similar structure (read the same set of partitions). The tables are in
>>> different data centers and one table is a backup copy of another one. I'm
>>> trying to achieve this goal through distcp-ing data into target DC in
>>> temporary folder, recreating all needed partitions in target table and
>>> moving files from temporary place to final place. But I'm stuck on issue of
>>> creating partitions with value ' __HIVE_DEFAULT_PARTITION__'
>>>
>>> So, my question is: Is it possible in hive to manually create partition
>>> with '__HIVE_DEFAULT_PARTITION__' value?
>>>
>>> Neither of this way work:
>>> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition=);
>>> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition='');
>>> ALTER TABLE table ADD IF NOT EXISTS PARTITION
>>> (partition='__HIVE_DEFAULT_PARTITION__');
>>>
>>> Thank you.
>>> Ivan Kruglov.
>>
>>
>>
>
>

Re: Creating partition with __HIVE_DEFAULT_PARTITION__ value

Posted by Ivan Kruglov <iv...@booking.com>.
Hi everyone,

Thank you for your answers.

On 24.09.2013, at 0:36, Stephen Sprague <sp...@gmail.com> wrote:

> If its any help I've done this kind of thing frequently:
>   
> 1. create the table on the new cluster.
> 
> 2. distcp the data right into the hdfs directory where the table resides on the new cluster - no temp storage required.
> 
> 3. run this hive command:   msck repair table <table>;   -- this command will create your partitions for you - its pretty slick that way.
> 
> 
> Let us know how it goes.

I was thinking about this case. But it doesn't work well for me. The thing is that the table is giant. It's about 200TB and has about 300K partitions. So, MSCK REPAIR TABLE takes forever to compete. And I would need to run it every day (I'm doing incremental-like distcp-ing). However, I can go another way. I can distcp data into temporary table which has small amount of partitions, run MSCK against it and then do something like "INSERT OVERWRITE TABLE target_table PARTITION(….) SELECT * FROM tmp_table". It should work.

> 
> 
> On Mon, Sep 23, 2013 at 10:46 AM, Edward Capriolo <ed...@gmail.com> wrote:
> Did you try ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition=NULL);
> 
> If that does not work you will need to create a dynamic partition type query that will create the dummy partition. File a jira if the above syntax does not work. There should be SOME way to create the default partition by hand.

Yes, I've tried it and it doesn't work as well. I will file a ticket.

> 
> 
> On Mon, Sep 23, 2013 at 10:48 AM, Ivan Kruglov <iv...@booking.com> wrote:
> Hello to everyone,
> 
> I'm working on the task of syncing data between two tables which have similar structure (read the same set of partitions). The tables are in different data centers and one table is a backup copy of another one. I'm trying to achieve this goal through distcp-ing data into target DC in temporary folder, recreating all needed partitions in target table and moving files from temporary place to final place. But I'm stuck on issue of creating partitions with value ' __HIVE_DEFAULT_PARTITION__'
> 
> So, my question is: Is it possible in hive to manually create partition with '__HIVE_DEFAULT_PARTITION__' value?
> 
> Neither of this way work:
> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition=);
> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition='');
> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition='__HIVE_DEFAULT_PARTITION__');
> 
> Thank you.
> Ivan Kruglov.
> 
> 


Re: Creating partition with __HIVE_DEFAULT_PARTITION__ value

Posted by Stephen Sprague <sp...@gmail.com>.
If its any help I've done this kind of thing frequently:

1. create the table on the new cluster.

2. distcp the data right into the hdfs directory where the table resides on
the new cluster - no temp storage required.

3. run this hive command:   msck repair table <table>;   -- this command
will create your partitions for you - its pretty slick that way.


Let us know how it goes.




On Mon, Sep 23, 2013 at 10:46 AM, Edward Capriolo <ed...@gmail.com>wrote:

> Did you try ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition=NULL);
>
> If that does not work you will need to create a dynamic partition type
> query that will create the dummy partition. File a jira if the above syntax
> does not work. There should be SOME way to create the default partition by
> hand.
>
>
> On Mon, Sep 23, 2013 at 10:48 AM, Ivan Kruglov <iv...@booking.com>wrote:
>
>> Hello to everyone,
>>
>> I'm working on the task of syncing data between two tables which have
>> similar structure (read the same set of partitions). The tables are in
>> different data centers and one table is a backup copy of another one. I'm
>> trying to achieve this goal through distcp-ing data into target DC in
>> temporary folder, recreating all needed partitions in target table and
>> moving files from temporary place to final place. But I'm stuck on issue of
>> creating partitions with value ' __HIVE_DEFAULT_PARTITION__'
>>
>> So, my question is: Is it possible in hive to manually create partition
>> with '__HIVE_DEFAULT_PARTITION__' value?
>>
>> Neither of this way work:
>> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition=);
>> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition='');
>> ALTER TABLE table ADD IF NOT EXISTS PARTITION
>> (partition='__HIVE_DEFAULT_PARTITION__');
>>
>> Thank you.
>> Ivan Kruglov.
>
>
>

Re: Creating partition with __HIVE_DEFAULT_PARTITION__ value

Posted by Edward Capriolo <ed...@gmail.com>.
Did you try ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition=NULL);

If that does not work you will need to create a dynamic partition type
query that will create the dummy partition. File a jira if the above syntax
does not work. There should be SOME way to create the default partition by
hand.


On Mon, Sep 23, 2013 at 10:48 AM, Ivan Kruglov <iv...@booking.com>wrote:

> Hello to everyone,
>
> I'm working on the task of syncing data between two tables which have
> similar structure (read the same set of partitions). The tables are in
> different data centers and one table is a backup copy of another one. I'm
> trying to achieve this goal through distcp-ing data into target DC in
> temporary folder, recreating all needed partitions in target table and
> moving files from temporary place to final place. But I'm stuck on issue of
> creating partitions with value ' __HIVE_DEFAULT_PARTITION__'
>
> So, my question is: Is it possible in hive to manually create partition
> with '__HIVE_DEFAULT_PARTITION__' value?
>
> Neither of this way work:
> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition=);
> ALTER TABLE table ADD IF NOT EXISTS PARTITION (partition='');
> ALTER TABLE table ADD IF NOT EXISTS PARTITION
> (partition='__HIVE_DEFAULT_PARTITION__');
>
> Thank you.
> Ivan Kruglov.