You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by raj hive <ra...@gmail.com> on 2016/06/06 03:02:17 UTC

alter partitions on hive external table

Hi friends,

I have created partitions on hive external tables. partitions on
datetime/userid/customerId.

now i have to change the order of the partitions for the existing data for
all the dates.

order of the partition is custerid/userid/datetime.

Anyone can help me, how to alter the partitions for the existing table.
Need a help to write a script to change the partions on existing data.
almost 3 months data is there to modify as per new partition so changing
each date is difficult. Any expert can help me.

Thanks
Raj

Re: alter partitions on hive external table

Posted by raj hive <ra...@gmail.com>.
Thank you so much for the detailed info.

On Tue, Jun 7, 2016 at 1:58 AM, Markovitz, Dudu <dm...@paypal.com>
wrote:

> And here is a full example
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
> --             bash
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> mkdir -p t
>
> mkdir -p t/20150122/dudu/cust1
>
> mkdir -p t/cust2/aaa/2015/01/23/bbb/ccc/dudu/ddd
>
> mkdir -p t/raj/20150204/cust1
>
> mkdir -p t/raj/cust2/yyy/20150204/zzz
>
> mkdir -p t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz
>
>
>
> echo -e "1\n2\n3" > t/20150122/dudu/cust1/data.txt
>
> echo -e "4"       > t/cust2/aaa/2015/01/23/bbb/ccc/dudu/ddd/data.txt
>
> echo -e "5\n6"    > t/raj/20150204/cust1/data.txt
>
> echo -e "7\n8\n9" > t/raj/cust2/yyy/20150204/zzz/data.txt
>
> echo -e "10"      >
> t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz/data.txt
>
>
>
> hdfs dfs -put t /tmp
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
> --             hive
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> n  We’re creating the external table with the requested partition columns
>
>
>
> create external table t (i int) partitioned by (user string,cust string,dt
> date) location '/tmp/t';
>
>
>
> n  We’re choosing each partition values according the full path of the
> relevant directory
>
>
>
> alter table t add partition (user='dudu',cust='cust1',dt=date '2015-01-22')
> location '/tmp/t/20150122/dudu/cust1';
>
> alter table t add partition (user='dudu',cust='cust2',dt=date '2015-01-23')
> location '/tmp/t/cust2/aaa/2015/01/23/bbb/ccc/dudu/ddd';
>
> alter table t add partition (user='raj' ,cust='cust1',dt=date '2015-02-04')
> location '/tmp/t/raj/20150204/cust1';
>
> alter table t add partition (user='raj' ,cust='cust2',dt=date '2015-02-04')
> location '/tmp/t/raj/cust2/yyy/20150204/zzz';
>
> alter table t add partition (user='raj' ,cust='cust3',dt=date '2015-02-04')
> location '/tmp/t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz';
>
>
>
> n  The partitions’ values and their corresponding locations are all saved
> in the metastore
>
> n  The metastore is being queried based on our query predicates.
> Returning the list of relevant partitions/locations
>
>
>
> explain dependency select * from t where (cust like '%1' and dt < date
> '2015-02-01') or (user='raj' and substr(cust,-1) = 3) ;
>
>
>
> {"input_partitions":[{"*partitionName*":"default@t@user=dudu/cust=cust1
> /dt=2015-01-22"},{"*partitionName*":"default@t@user=raj/cust=cust3/dt=
> 2015-02-04"}],"input_tables":[{"tablename":"default@t
> ","tabletype":"EXTERNAL_TABLE"}]}
>
>
>
> select *,input__file__name from t where (cust like '%1' and dt < date
> '2015-02-01') or (user='raj' and substr(cust,-1) = 3) ;
>
>
>
> 1         dudu      cust1     2015-01-22
> hdfs://quickstart.cloudera:8020/tmp/t/20150122/dudu/cust1/data.txt
>
> 2         dudu      cust1     2015-01-22
> hdfs://quickstart.cloudera:8020/tmp/t/20150122/dudu/cust1/data.txt
>
> 3         dudu      cust1     2015-01-22
> hdfs://quickstart.cloudera:8020/tmp/t/20150122/dudu/cust1/data.txt
>
> 10        raj       cust3     2015-02-04
> hdfs://quickstart.cloudera:8020/tmp/t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz/data.txt
>
>
>
>
>
>
>
> *From:* Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
> *Sent:* Monday, June 06, 2016 6:10 PM
>
> *To:* user@hive.apache.org
> *Subject:* RE: alter partitions on hive external table
>
>
>
> … are just logical connections between certain values and specific
> directories …
>
>
>
> *From:* Markovitz, Dudu [mailto:dmarkovitz@paypal.com
> <dm...@paypal.com>]
> *Sent:* Monday, June 06, 2016 6:07 PM
> *To:* user@hive.apache.org
> *Subject:* RE: alter partitions on hive external table
>
>
>
> Hi Raj
>
>
>
> 1.       I don’t understand the reason for this change, can you please
> elaborate?
>
>
>
> 2.       External table is just an interface. Instructions for how to
> read existing data.
>
> Partitions of external table are just a logical connections between
> certain values and a specific directories.
>
> You can connect any set of values to any directory no matter what the
> directories structure is and then query the external table filtering on
> this values and by that eliminating the query only to the directories you
> are interested in.
>
>
>
> 3.       By all means, don’t duplicate data without a good reason (unless
> you don’t care about wasting storage, time, CPU etc.)
>
>
>
> It seems to me that all you need to do is to retrieve a list of the
> directories and generate “alter table … add partition…” statements based on
> that.
>
>
>
> Dudu
>
>
>
> *From:* raj hive [mailto:raj.hiveql@gmail.com <ra...@gmail.com>]
> *Sent:* Monday, June 06, 2016 6:02 AM
> *To:* user@hive.apache.org
> *Subject:* alter partitions on hive external table
>
>
>
> Hi friends,
>
> I have created partitions on hive external tables. partitions on
> datetime/userid/customerId.
>
> now i have to change the order of the partitions for the existing data for
> all the dates.
>
> order of the partition is custerid/userid/datetime.
>
> Anyone can help me, how to alter the partitions for the existing table.
> Need a help to write a script to change the partions on existing data.
> almost 3 months data is there to modify as per new partition so changing
> each date is difficult. Any expert can help me.
>
> Thanks
>
> Raj
>

RE: alter partitions on hive external table

Posted by "Markovitz, Dudu" <dm...@paypal.com>.
And here is a full example

----------------------------------------------------------------------------------------------------
--             bash
----------------------------------------------------------------------------------------------------

mkdir -p t
mkdir -p t/20150122/dudu/cust1
mkdir -p t/cust2/aaa/2015/01/23/bbb/ccc/dudu/ddd
mkdir -p t/raj/20150204/cust1
mkdir -p t/raj/cust2/yyy/20150204/zzz
mkdir -p t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz

echo -e "1\n2\n3" > t/20150122/dudu/cust1/data.txt
echo -e "4"       > t/cust2/aaa/2015/01/23/bbb/ccc/dudu/ddd/data.txt
echo -e "5\n6"    > t/raj/20150204/cust1/data.txt
echo -e "7\n8\n9" > t/raj/cust2/yyy/20150204/zzz/data.txt
echo -e "10"      > t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz/data.txt

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--             hive
----------------------------------------------------------------------------------------------------


n  We’re creating the external table with the requested partition columns


create external table t (i int) partitioned by (user string,cust string,dt date) location '/tmp/t';


n  We’re choosing each partition values according the full path of the relevant directory

alter table t add partition (user='dudu',cust='cust1',dt=date '2015-01-22') location '/tmp/t/20150122/dudu/cust1';
alter table t add partition (user='dudu',cust='cust2',dt=date '2015-01-23') location '/tmp/t/cust2/aaa/2015/01/23/bbb/ccc/dudu/ddd';
alter table t add partition (user='raj' ,cust='cust1',dt=date '2015-02-04') location '/tmp/t/raj/20150204/cust1';
alter table t add partition (user='raj' ,cust='cust2',dt=date '2015-02-04') location '/tmp/t/raj/cust2/yyy/20150204/zzz';
alter table t add partition (user='raj' ,cust='cust3',dt=date '2015-02-04') location '/tmp/t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz';


n  The partitions’ values and their corresponding locations are all saved in the metastore

n  The metastore is being queried based on our query predicates. Returning the list of relevant partitions/locations


explain dependency select * from t where (cust like '%1' and dt < date '2015-02-01') or (user='raj' and substr(cust,-1) = 3) ;

{"input_partitions":[{"partitionName":"default@t@user=dudu/cust=cust1/dt=2015-01-22"},{"partitionName":"default@t@user=raj/cust=cust3/dt=2015-02-04"}],"input_tables":[{"tablename":"default@t","tabletype":"EXTERNAL_TABLE"}]}

select *,input__file__name from t where (cust like '%1' and dt < date '2015-02-01') or (user='raj' and substr(cust,-1) = 3) ;

1         dudu      cust1     2015-01-22         hdfs://quickstart.cloudera:8020/tmp/t/20150122/dudu/cust1/data.txt
2         dudu      cust1     2015-01-22         hdfs://quickstart.cloudera:8020/tmp/t/20150122/dudu/cust1/data.txt
3         dudu      cust1     2015-01-22         hdfs://quickstart.cloudera:8020/tmp/t/20150122/dudu/cust1/data.txt
10        raj       cust3     2015-02-04          hdfs://quickstart.cloudera:8020/tmp/t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz/data.txt



From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Monday, June 06, 2016 6:10 PM
To: user@hive.apache.org
Subject: RE: alter partitions on hive external table

… are just logical connections between certain values and specific directories …

From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Monday, June 06, 2016 6:07 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: alter partitions on hive external table

Hi Raj


1.       I don’t understand the reason for this change, can you please elaborate?



2.       External table is just an interface. Instructions for how to read existing data.

Partitions of external table are just a logical connections between certain values and a specific directories.

You can connect any set of values to any directory no matter what the directories structure is and then query the external table filtering on this values and by that eliminating the query only to the directories you are interested in.



3.       By all means, don’t duplicate data without a good reason (unless you don’t care about wasting storage, time, CPU etc.)

It seems to me that all you need to do is to retrieve a list of the directories and generate “alter table … add partition…” statements based on that.


Dudu

From: raj hive [mailto:raj.hiveql@gmail.com]
Sent: Monday, June 06, 2016 6:02 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: alter partitions on hive external table

Hi friends,
I have created partitions on hive external tables. partitions on datetime/userid/customerId.
now i have to change the order of the partitions for the existing data for all the dates.
order of the partition is custerid/userid/datetime.
Anyone can help me, how to alter the partitions for the existing table. Need a help to write a script to change the partions on existing data. almost 3 months data is there to modify as per new partition so changing each date is difficult. Any expert can help me.
Thanks
Raj

RE: alter partitions on hive external table

Posted by "Markovitz, Dudu" <dm...@paypal.com>.
… are just logical connections between certain values and specific directories …

From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Monday, June 06, 2016 6:07 PM
To: user@hive.apache.org
Subject: RE: alter partitions on hive external table

Hi Raj


1.       I don’t understand the reason for this change, can you please elaborate?



2.       External table is just an interface. Instructions for how to read existing data.

Partitions of external table are just a logical connections between certain values and a specific directories.

You can connect any set of values to any directory no matter what the directories structure is and then query the external table filtering on this values and by that eliminating the query only to the directories you are interested in.



3.       By all means, don’t duplicate data without a good reason (unless you don’t care about wasting storage, time, CPU etc.)

It seems to me that all you need to do is to retrieve a list of the directories and generate “alter table … add partition…” statements based on that.


Dudu

From: raj hive [mailto:raj.hiveql@gmail.com]
Sent: Monday, June 06, 2016 6:02 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: alter partitions on hive external table

Hi friends,
I have created partitions on hive external tables. partitions on datetime/userid/customerId.
now i have to change the order of the partitions for the existing data for all the dates.
order of the partition is custerid/userid/datetime.
Anyone can help me, how to alter the partitions for the existing table. Need a help to write a script to change the partions on existing data. almost 3 months data is there to modify as per new partition so changing each date is difficult. Any expert can help me.
Thanks
Raj

RE: alter partitions on hive external table

Posted by "Markovitz, Dudu" <dm...@paypal.com>.
Hi Raj


1.       I don’t understand the reason for this change, can you please elaborate?



2.       External table is just an interface. Instructions for how to read existing data.

Partitions of external table are just a logical connections between certain values and a specific directories.

You can connect any set of values to any directory no matter what the directories structure is and then query the external table filtering on this values and by that eliminating the query only to the directories you are interested in.



3.       By all means, don’t duplicate data without a good reason (unless you don’t care about wasting storage, time, CPU etc.)

It seems to me that all you need to do is to retrieve a list of the directories and generate “alter table … add partition…” statements based on that.


Dudu

From: raj hive [mailto:raj.hiveql@gmail.com]
Sent: Monday, June 06, 2016 6:02 AM
To: user@hive.apache.org
Subject: alter partitions on hive external table

Hi friends,
I have created partitions on hive external tables. partitions on datetime/userid/customerId.
now i have to change the order of the partitions for the existing data for all the dates.
order of the partition is custerid/userid/datetime.
Anyone can help me, how to alter the partitions for the existing table. Need a help to write a script to change the partions on existing data. almost 3 months data is there to modify as per new partition so changing each date is difficult. Any expert can help me.
Thanks
Raj

Re: alter partitions on hive external table

Posted by Margus Roo <ma...@roo.ee>.
Hi

The first idea pops up is:

1. HDFS commands to copy your existing structure and data to support a 
new partitions structure.

2. Create a new on temporary hive external table

3. (optional) if you created temporary table then drop old one and 
insert ... select from temporary table.


Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
+372 51 48 780

On 06/06/16 06:02, raj hive wrote:
> Hi friends,
>
> I have created partitions on hive external tables. partitions on 
> datetime/userid/customerId.
>
> now i have to change the order of the partitions for the existing data 
> for all the dates.
>
> order of the partition is custerid/userid/datetime.
>
> Anyone can help me, how to alter the partitions for the existing 
> table. Need a help to write a script to change the partions on 
> existing data. almost 3 months data is there to modify as per new 
> partition so changing each date is difficult. Any expert can help me.
>
> Thanks
> Raj


Re: alter partitions on hive external table

Posted by Mich Talebzadeh <mi...@gmail.com>.
so you are doing this for partition elimination?

it is a tough call whatever you do

Since userid is unique you can try

CLUSTERED BY (userid,datetime,customerid) INTO 256 BUCKETS

or try creating a new table based on new column partition and insert/select
part of data and see it actually improves performance.

I much doubt whichever way you go it is really going to have that impact on
your performance.

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 08:18, raj hive <ra...@gmail.com> wrote:

> Hi Mich,
>
> table type is external table. Yes, I am doing this for certain queries
> where userid as the most significant column.
>
> On Mon, Jun 6, 2016 at 12:35 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> That order datetime/userid/customerId looks more natural to me.
>>
>> Two questions:
>>
>> What is the type of table in Hive?
>>
>> Are you doing this for certain queries where you think userid as the most
>> significant column is going to help queries better?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 6 June 2016 at 04:02, raj hive <ra...@gmail.com> wrote:
>>
>>> Hi friends,
>>>
>>> I have created partitions on hive external tables. partitions on
>>> datetime/userid/customerId.
>>>
>>> now i have to change the order of the partitions for the existing data
>>> for all the dates.
>>>
>>> order of the partition is custerid/userid/datetime.
>>>
>>> Anyone can help me, how to alter the partitions for the existing table.
>>> Need a help to write a script to change the partions on existing data.
>>> almost 3 months data is there to modify as per new partition so changing
>>> each date is difficult. Any expert can help me.
>>>
>>> Thanks
>>> Raj
>>>
>>
>>
>

Re: alter partitions on hive external table

Posted by raj hive <ra...@gmail.com>.
Hi Mich,

table type is external table. Yes, I am doing this for certain queries
where userid as the most significant column.

On Mon, Jun 6, 2016 at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> That order datetime/userid/customerId looks more natural to me.
>
> Two questions:
>
> What is the type of table in Hive?
>
> Are you doing this for certain queries where you think userid as the most
> significant column is going to help queries better?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 6 June 2016 at 04:02, raj hive <ra...@gmail.com> wrote:
>
>> Hi friends,
>>
>> I have created partitions on hive external tables. partitions on
>> datetime/userid/customerId.
>>
>> now i have to change the order of the partitions for the existing data
>> for all the dates.
>>
>> order of the partition is custerid/userid/datetime.
>>
>> Anyone can help me, how to alter the partitions for the existing table.
>> Need a help to write a script to change the partions on existing data.
>> almost 3 months data is there to modify as per new partition so changing
>> each date is difficult. Any expert can help me.
>>
>> Thanks
>> Raj
>>
>
>

Re: alter partitions on hive external table

Posted by Mich Talebzadeh <mi...@gmail.com>.
That order datetime/userid/customerId looks more natural to me.

Two questions:

What is the type of table in Hive?

Are you doing this for certain queries where you think userid as the most
significant column is going to help queries better?

HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 04:02, raj hive <ra...@gmail.com> wrote:

> Hi friends,
>
> I have created partitions on hive external tables. partitions on
> datetime/userid/customerId.
>
> now i have to change the order of the partitions for the existing data for
> all the dates.
>
> order of the partition is custerid/userid/datetime.
>
> Anyone can help me, how to alter the partitions for the existing table.
> Need a help to write a script to change the partions on existing data.
> almost 3 months data is there to modify as per new partition so changing
> each date is difficult. Any expert can help me.
>
> Thanks
> Raj
>