You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Raghunath, Ranjith" <Ra...@usaa.com> on 2011/12/15 03:23:29 UTC

bucketing in hive

Can one use bucketing in hive to emulate hash partitions on a database? Is there also a way to segment data into buckets dynamically based on values in the column. For example,

Col1                       Col2
Apple                    1
Orange                 2
Apple                    2
Banana                 1

If the file above were inserted into a table with Col1 as the bucket column, can we dynamically allow all of the rows with “Apple” in one file and “Orange” in one file and so on. Is there a way to do this without specifying the bucket size to be 3.
Thank you,
Ranjith

Re: bucketing in hive

Posted by "Raghunath, Ranjith" <Ra...@usaa.com>.
Thanks for the help with this.

Thanks,
Ranjith

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Friday, December 16, 2011 04:41 AM
To: user@hive.apache.org <us...@hive.apache.org>
Subject: Re: bucketing in hive

Ranjith
    You can definitely change the number of buckets in a hive table even after its creation. You need to issue an alter table command that contains the CLUSTERED BY and/or SORTED BY clauses used by your table. For example if I have a table whose DDL looks like this

CREATE EXTERNAL TABLE employee
(
  emp_id STRING, emp_name STRING,
  dept STRING, location STRING,
)
CLUSTERED BY(dept,location) SORTED BY(dept,location) INTO 15 BUCKETS ;

You can ALTER the number of BUCKETS using the ALTER TABLE command as


ALTER TABLE employee CLUSTERED BY(dept,location) SORTED BY(dept,location) INTO 20 BUCKETS ;

The one major factor you need to consider here is that if you are using sampling queries on a partitioned - bucketed tables, you need to keep in mind that the older partitions may have different number of buckets where as the new partitions after the ALTER statement would have a different number of buckets.

Hope it helps!...

Regards
Bejoy.K.S

________________________________
From: "Raghunath, Ranjith" <Ra...@usaa.com>
To: "'user@hive.apache.org'" <us...@hive.apache.org>; "'bejoy_ks@yahoo.com'" <be...@yahoo.com>
Sent: Friday, December 16, 2011 10:48 AM
Subject: Re: bucketing in hive

Thanks Bejoy. Appreciate the insight.

Do you know of altering the number of buckets once a table has been set up?

Thanks,
Ranjith

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Thursday, December 15, 2011 06:13 AM
To: user@hive.apache.org <us...@hive.apache.org>; hive dev list <de...@hive.apache.org>
Subject: Re: bucketing in hive

Hi Ranjith
    I'm not aware of any Dynamic Bucketing in hive where as there is definitely  Dynamic Partitions available. Your partitions/sub partitions would be generated on the fly/dynamically based on the value of a particular column .The records with same values for that column would go into the same partition. But  Dynamic Partition load can't happen with a LOAD DATA statement as it requires running mapreduce job, You can utilize dynamic partitions in 2 steps for delimited files
- Load delimited file into a non partitioned table in hive using LOAD DATA
- Load data into destination table from the source table using INSERT OVERWRITE - here a MR job would be triggered that would do the job for you.

I have scribbled something down on the same, check whether it'd be useful for you.
http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html

Regards
Bejoy.K.S

________________________________
From: "Raghunath, Ranjith" <Ra...@usaa.com>
To: "user@hive.apache.org" <us...@hive.apache.org>; hive dev list <de...@hive.apache.org>
Sent: Thursday, December 15, 2011 7:53 AM
Subject: bucketing in hive

Can one use bucketing in hive to emulate hash partitions on a database? Is there also a way to segment data into buckets dynamically based on values in the column. For example,

Col1                       Col2
Apple                    1
Orange                 2
Apple                    2
Banana                 1

If the file above were inserted into a table with Col1 as the bucket column, can we dynamically allow all of the rows with “Apple” in one file and “Orange” in one file and so on. Is there a way to do this without specifying the bucket size to be 3.
Thank you,
Ranjith





Re: bucketing in hive

Posted by Bejoy Ks <be...@yahoo.com>.
Ranjith
    You can definitely change the number of buckets in a hive table even after its creation. You need to issue an alter table command that contains the CLUSTERED BY and/or SORTED BY clauses used by your table. For example if I have a table whose DDL looks like this

CREATE EXTERNAL TABLE employee
(
  emp_id STRING, emp_name STRING,
  dept STRING, location STRING,
)
CLUSTERED BY(dept,location) SORTED BY(dept,location) INTO 15 BUCKETS ;

You can ALTER the number of BUCKETS using the ALTER TABLE command as


ALTER TABLE employee CLUSTERED BY(dept,location) SORTED BY(dept,location) INTO 20 BUCKETS ;


The one major factor you need to consider here is that if you are using sampling queries on a partitioned - bucketed tables, you need to keep in mind that the older partitions may have different number of buckets where as the new partitions after the ALTER statement would have a different number of buckets.

Hope it helps!...

Regards
Bejoy.K.S



________________________________
 From: "Raghunath, Ranjith" <Ra...@usaa.com>
To: "'user@hive.apache.org'" <us...@hive.apache.org>; "'bejoy_ks@yahoo.com'" <be...@yahoo.com> 
Sent: Friday, December 16, 2011 10:48 AM
Subject: Re: bucketing in hive
 

Thanks Bejoy. Appreciate the insight. 

Do you know of altering the number of buckets once a table has been set up? 

Thanks, 
Ranjith 
 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Thursday, December 15, 2011 06:13 AM
To: user@hive.apache.org <us...@hive.apache.org>; hive dev list <de...@hive.apache.org> 
Subject: Re: bucketing in hive 
 

Hi Ranjith
    I'm not aware of any Dynamic Bucketing in hive where as there is definitely  Dynamic Partitions available. Your partitions/sub partitions would be generated on the fly/dynamically based on the value of a particular column .The records with same values for that column would go into the same partition. But  Dynamic Partition load can't happen with a LOAD DATA statement as it requires running mapreduce job, You can utilize dynamic partitions in 2 steps for delimited files
- Load delimited file into a non partitioned table in hive using LOAD DATA

- Load data into destination table from the source table using INSERT OVERWRITE - here a MR job would be triggered that would do the job for you.

I have scribbled something down on the same, check whether it'd be useful for you.
http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html

Regards
Bejoy.K.S



________________________________
 From: "Raghunath, Ranjith" <Ra...@usaa.com>
To: "user@hive.apache.org" <us...@hive.apache.org>; hive dev list <de...@hive.apache.org> 
Sent: Thursday, December 15, 2011 7:53 AM
Subject: bucketing in hive


 
Can one use bucketing in hive to emulate hash partitions on a database? Is there also a way to segment data into buckets dynamically based on values in the column. For example, 
 
Col1                       Col2
Apple                    1
Orange                 2
Apple                    2
Banana                 1
 
If the file above were inserted into a table with Col1 as the bucket column, can we dynamically allow all of the rows with “Apple” in one file and “Orange” in one file and so on. Is there a way to do this without specifying the bucket size to be 3.  
Thank you, 
Ranjith 

Re: bucketing in hive

Posted by "Raghunath, Ranjith" <Ra...@usaa.com>.
Thanks Bejoy. Appreciate the insight.

Do you know of altering the number of buckets once a table has been set up?

Thanks,
Ranjith

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Thursday, December 15, 2011 06:13 AM
To: user@hive.apache.org <us...@hive.apache.org>; hive dev list <de...@hive.apache.org>
Subject: Re: bucketing in hive

Hi Ranjith
    I'm not aware of any Dynamic Bucketing in hive where as there is definitely  Dynamic Partitions available. Your partitions/sub partitions would be generated on the fly/dynamically based on the value of a particular column .The records with same values for that column would go into the same partition. But  Dynamic Partition load can't happen with a LOAD DATA statement as it requires running mapreduce job, You can utilize dynamic partitions in 2 steps for delimited files
- Load delimited file into a non partitioned table in hive using LOAD DATA
- Load data into destination table from the source table using INSERT OVERWRITE - here a MR job would be triggered that would do the job for you.

I have scribbled something down on the same, check whether it'd be useful for you.
http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html

Regards
Bejoy.K.S

________________________________
From: "Raghunath, Ranjith" <Ra...@usaa.com>
To: "user@hive.apache.org" <us...@hive.apache.org>; hive dev list <de...@hive.apache.org>
Sent: Thursday, December 15, 2011 7:53 AM
Subject: bucketing in hive

Can one use bucketing in hive to emulate hash partitions on a database? Is there also a way to segment data into buckets dynamically based on values in the column. For example,

Col1                       Col2
Apple                    1
Orange                 2
Apple                    2
Banana                 1

If the file above were inserted into a table with Col1 as the bucket column, can we dynamically allow all of the rows with “Apple” in one file and “Orange” in one file and so on. Is there a way to do this without specifying the bucket size to be 3.
Thank you,
Ranjith



Re: bucketing in hive

Posted by Bejoy Ks <be...@yahoo.com>.
Hi Ranjith
    I'm not aware of any Dynamic Bucketing in hive where as there is definitely  Dynamic Partitions available. Your partitions/sub partitions would be generated on the fly/dynamically based on the value of a particular column .The records with same values for that column would go into the same partition. But  Dynamic Partition load can't happen with a LOAD DATA statement as it requires running mapreduce job, You can utilize dynamic partitions in 2 steps for delimited files
- Load delimited file into a non partitioned table in hive using LOAD DATA

- Load data into destination table from the source table using INSERT OVERWRITE - here a MR job would be triggered that would do the job for you.

I have scribbled something down on the same, check whether it'd be useful for you.
http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html

Regards
Bejoy.K.S



________________________________
 From: "Raghunath, Ranjith" <Ra...@usaa.com>
To: "user@hive.apache.org" <us...@hive.apache.org>; hive dev list <de...@hive.apache.org> 
Sent: Thursday, December 15, 2011 7:53 AM
Subject: bucketing in hive
 

 
Can one use bucketing in hive to emulate hash partitions on a database? Is there also a way to segment data into buckets dynamically based on values in the column. For example, 
 
Col1                       Col2
Apple                    1
Orange                 2
Apple                    2
Banana                 1
 
If the file above were inserted into a table with Col1 as the bucket column, can we dynamically allow all of the rows with “Apple” in one file and “Orange” in one file and so on. Is there a way to do this without specifying the bucket size to be 3. 
Thank you, 
Ranjith 

Re: bucketing in hive

Posted by Bejoy Ks <be...@yahoo.com>.
Hi Ranjith
    I'm not aware of any Dynamic Bucketing in hive where as there is definitely  Dynamic Partitions available. Your partitions/sub partitions would be generated on the fly/dynamically based on the value of a particular column .The records with same values for that column would go into the same partition. But  Dynamic Partition load can't happen with a LOAD DATA statement as it requires running mapreduce job, You can utilize dynamic partitions in 2 steps for delimited files
- Load delimited file into a non partitioned table in hive using LOAD DATA

- Load data into destination table from the source table using INSERT OVERWRITE - here a MR job would be triggered that would do the job for you.

I have scribbled something down on the same, check whether it'd be useful for you.
http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html

Regards
Bejoy.K.S



________________________________
 From: "Raghunath, Ranjith" <Ra...@usaa.com>
To: "user@hive.apache.org" <us...@hive.apache.org>; hive dev list <de...@hive.apache.org> 
Sent: Thursday, December 15, 2011 7:53 AM
Subject: bucketing in hive
 

 
Can one use bucketing in hive to emulate hash partitions on a database? Is there also a way to segment data into buckets dynamically based on values in the column. For example, 
 
Col1                       Col2
Apple                    1
Orange                 2
Apple                    2
Banana                 1
 
If the file above were inserted into a table with Col1 as the bucket column, can we dynamically allow all of the rows with “Apple” in one file and “Orange” in one file and so on. Is there a way to do this without specifying the bucket size to be 3. 
Thank you, 
Ranjith