You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ashok Kumar <as...@yahoo.com> on 2015/04/10 18:46:09 UTC

partition and bucket

Greeting all,

Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL.

I would like to understand partition and bucketing in Hive and the difference between.

Shall be grateful if someone explains where  shall I use partition or bucket for best practices.

thanks

Re: partition and bucket

Posted by Ashok Kumar <as...@yahoo.com>.

Thank you sir. Much appreciated 

     On Sunday, 12 April 2015, 21:05, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

 #yiv0994893552 #yiv0994893552 -- _filtered #yiv0994893552 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv0994893552 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv0994893552 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;}#yiv0994893552 #yiv0994893552 p.yiv0994893552MsoNormal, #yiv0994893552 li.yiv0994893552MsoNormal, #yiv0994893552 div.yiv0994893552MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0994893552 a:link, #yiv0994893552 span.yiv0994893552MsoHyperlink {color:blue;text-decoration:underline;}#yiv0994893552 a:visited, #yiv0994893552 span.yiv0994893552MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv0994893552 p.yiv0994893552MsoListParagraph, #yiv0994893552 li.yiv0994893552MsoListParagraph, #yiv0994893552 div.yiv0994893552MsoListParagraph {margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0994893552 span.yiv0994893552EmailStyle17 {color:windowtext;}#yiv0994893552 span.yiv0994893552apple-converted-space {}#yiv0994893552 .yiv0994893552MsoChpDefault {} _filtered #yiv0994893552 {margin:72.0pt 72.0pt 72.0pt 72.0pt;}#yiv0994893552 div.yiv0994893552WordSection1 {}#yiv0994893552 _filtered #yiv0994893552 {} _filtered #yiv0994893552 {}#yiv0994893552 ol {margin-bottom:0cm;}#yiv0994893552 ul {margin-bottom:0cm;}#yiv0994893552 Hi,  I will try to have a go at your points but I am sure there are many experts around.  As you may know already in RDBMS partitioning (dividing a very large table into sub-tables conceptually) is deployed to address three areast.   1.     Availability -- each partition can reside on a different tablespace/device. Hence a problem with a tablespace/device will take out a slice of the table's data instead of the whole thing. This does not really ap[ply to Hive with 3 block replication as standard2.     Manageability -- partitioning provides a mechanism for splitting whole table jobs into clear batches. Partition exchange can make it easier to bulk load data. Defragging, moving older partitions to lower tier storage, updating stats etc Most of these benefits apply to Hive as well. Please check the docs. 3.     Performance -- partition elimination   In simplest form (excluding composite partitioning), Hive partitioning will be similar to “range partitioning” in RDBMS. One can partition a table (say partitioned_table as shown below which is batch loaded from non_partitioned_table) -- by country, year, month etc. Each partition will be stored in Hive under sub-directory table/year/month like below  /user/hive/warehouse/scratchpad.db/partitioned_table/country=Italy/year=2014/month=Feb  Hive does not have the concept of indexes local or global as yet. So without partitioning a simple query in Hive will have to read the entire table even if it is filtering a smaller result set (WHERE CLAUSE). This becomes a bottleneck for running multiple MapReduce jobs over a large table. So partitioning will help localise the query by hitting the relevant sub-directory or sub-directories only. There is another important aspect with Hive as well. The locking granularity will be determined by the lowest slice in the filing system (sub-directory). So entering data into the above partition/file, will take an exclusive lock on that partition/file but crucially the rest of partitions will be available (assuming concurrency in Hive is enabled).   +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+|  lockid  |  database   |         table          |             partition              | lock_state  |  lock_type   | transaction_id  | last_heartbeat  |  acquired_at   |  user   | hostname  |+----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+| Lock ID  | Database    | Table                  | Partition                          | State       | Type         | Transaction ID  | Last Hearbeat   | Acquired At    | User    | Hostname  || 1711     | scratchpad  | non_partitioned_table  | NULL                               | ACQUIRED    | SHARED_READ  | NULL            | 1428862154670   | 1428862151904  | hduser  | rhes564   || 1711     | scratchpad  | partitioned_table      | country=Italy/year=2014/month=Feb  | ACQUIRED    | EXCLUSIVE    | NULL            | 1428862154670   | 1428862151905  | hduser  | rhes564   |+----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+  Now your point 2, bucketing in Hive refers to hash partitioning where a hashing function is applied. Likewise an RDBMS, Hive will apply a linear hashing algorithm to prevent data from clustering within specific partitions. Hashing is very effective if the column selected for bucketing has very high selectivity like an ID column where selectivity (select count(distinct(column))/count(column) ) = 1.  In this case, the created partitions/ files will be as evenly sized as possible. In a nutshell bucketing is a method to get data evenly distributed over many partitions/files.  One should define the number of buckets by a power of two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing will help concurrency in Hive. It may even allow a partition wise join i.e. a join between two tables that are bucketed on the same column with the same number of buckets (anyone has tried this?)  One more things. When one defines the number of buckets at table creation level in Hive, the number of partitions/files will be fixed. In contrast, with partitioning you do not have this limitation.   HTH  Mich    NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.  From: Ashok Kumar [mailto:ashok34668@yahoo.com] 
Sent: 10 April 2015 17:46
To: user@hive.apache.org
Subject: partition and bucket  
| Greeting all,

Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL.

I would like to understand partition and bucketing in Hive and the difference between.

Shall be grateful if someone explains where shall I use partition or bucket for best practices.

thanks |

Re: partition and bucket

Posted by Devopam Mittra <de...@gmail.com>.

+1
quite well explained. liked it much

regards
Dev

On Mon, Apr 13, 2015 at 1:34 AM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:

> Hi,
>
>
>
> I will try to have a go at your points but I am sure there are many
> experts around.
>
>
>
> As you may know already in RDBMS partitioning (dividing a very large table
> into sub-tables conceptually) is deployed to address three areast.
>
>
>
> 1.     Availability -- each partition can reside on a different
> tablespace/device. Hence a problem with a tablespace/device will take out a
> slice of the table's data instead of the whole thing. This does not really
> ap[ply to Hive with 3 block replication as standard
>
> 2.     Manageability -- partitioning provides a mechanism for splitting
> whole table jobs into clear batches. Partition exchange can make it easier
> to bulk load data. Defragging, moving older partitions to lower tier
> storage, updating stats etc Most of these benefits apply to Hive as well.
> Please check the docs.
>
> 3.     Performance -- partition elimination
>
>
>
> In simplest form (excluding composite partitioning), Hive partitioning
> will be similar to “range partitioning” in RDBMS. One can partition a table
> (say *partitioned_table* as shown below which is batch loaded from
> *non_partitioned_table*) -- by country, year, month etc. Each partition
> will be stored in Hive under sub-directory *table/year/month* like below
>
>
>
> /user/hive/warehouse/scratchpad.db
> */partitioned_table/country=Italy/year=2014/month=Feb*
>
>
>
> Hive does not have the concept of indexes local or global as yet. So
> without partitioning a simple query in Hive will have to read the entire
> table even if it is filtering a smaller result set (WHERE CLAUSE). This
> becomes a bottleneck for running multiple MapReduce jobs over a large table. So
> partitioning will help localise the query by hitting the relevant
> sub-directory or sub-directories only. There is another important aspect
> with Hive as well. The locking granularity will be determined by the lowest
> slice in the filing system (sub-directory). So entering data into the above
> partition/file, will take an exclusive lock on that partition/file but
> crucially the rest of partitions will be available (assuming concurrency in
> Hive is enabled).
>
>
>
>
> +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+
>
> |  lockid  |  database   |         table          |
> partition              | lock_state  |  lock_type   | transaction_id  |
> last_heartbeat  |  acquired_at   |  user   | hostname  |
>
>
> +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+
>
> | Lock ID  | Database    | Table                  |
> Partition                          | State       | Type         |
> Transaction ID  | Last Hearbeat   | Acquired At    | User    | Hostname  |
>
> | 1711     | scratchpad  | non_partitioned_table  |
> NULL                               | ACQUIRED    | *SHARED_READ*  |
> NULL            | 1428862154670   | 1428862151904  | hduser  | rhes564   |
>
> | 1711     | scratchpad  | *partitioned_table      |
> country=Italy/year=2014/month=Feb*  | ACQUIRED    | *EXCLUSIVE *   |
> NULL            | 1428862154670   | 1428862151905  | hduser  | rhes564   |
>
>
> +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+
>
>
>
> Now your point 2, bucketing in Hive refers to hash partitioning where a
> hashing function is applied. Likewise an RDBMS, Hive will apply a linear
> hashing algorithm to prevent data from clustering within specific
> partitions. Hashing is very effective if the column selected for bucketing
> has very high selectivity like an ID column where selectivity (*select
> count(distinct(column))/count(column)* ) = 1.  In this case, the created
> partitions/ files will be as evenly sized as possible. In a nutshell
> bucketing is a method to get data evenly distributed over many
> partitions/files.  One should define the number of buckets by a power of
> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
> will help concurrency in Hive. It may even allow a *partition wise join*
> i.e. a join between two tables that are bucketed on the same column with
> the same number of buckets (anyone has tried this?)
>
>
>
> One more things. When one defines the number of buckets at table creation
> level in Hive, the number of partitions/files will be fixed. In contrast,
> with partitioning you do not have this limitation.
>
>
>
> HTH
>
>
>
> Mich
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>
>
> *From:* Ashok Kumar [mailto:ashok34668@yahoo.com]
> *Sent:* 10 April 2015 17:46
> *To:* user@hive.apache.org
> *Subject:* partition and bucket
>
>
>
> Greeting all,
>
> Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL.
>
> I would like to understand partition and bucketing in Hive and the
> difference between.
>
> Shall be grateful if someone explains where shall I use partition or
> bucket for best practices.
>
> thanks
>
>
>



-- 
Devopam Mittra
Life and Relations are not binary

RE: partition and bucket

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Hi,

 

I will try to have a go at your points but I am sure there are many experts around.

 

As you may know already in RDBMS partitioning (dividing a very large table into sub-tables conceptually) is deployed to address three areast. 

 

1.     Availability -- each partition can reside on a different tablespace/device. Hence a problem with a tablespace/device will take out a slice of the table's data instead of the whole thing. This does not really ap[ply to Hive with 3 block replication as standard

2.     Manageability -- partitioning provides a mechanism for splitting whole table jobs into clear batches. Partition exchange can make it easier to bulk load data. Defragging, moving older partitions to lower tier storage, updating stats etc Most of these benefits apply to Hive as well. Please check the docs. 

3.     Performance -- partition elimination 

 

In simplest form (excluding composite partitioning), Hive partitioning will be similar to “range partitioning” in RDBMS. One can partition a table (say partitioned_table as shown below which is batch loaded from non_partitioned_table) -- by country, year, month etc. Each partition will be stored in Hive under sub-directory table/year/month like below

 

/user/hive/warehouse/scratchpad.db/partitioned_table/country=Italy/year=2014/month=Feb

 

Hive does not have the concept of indexes local or global as yet. So without partitioning a simple query in Hive will have to read the entire table even if it is filtering a smaller result set (WHERE CLAUSE). This becomes a bottleneck for running multiple MapReduce jobs over a large table. So partitioning will help localise the query by hitting the relevant sub-directory or sub-directories only. There is another important aspect with Hive as well. The locking granularity will be determined by the lowest slice in the filing system (sub-directory). So entering data into the above partition/file, will take an exclusive lock on that partition/file but crucially the rest of partitions will be available (assuming concurrency in Hive is enabled). 

 

+----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+

|  lockid  |  database   |         table          |             partition              | lock_state  |  lock_type   | transaction_id  | last_heartbeat  |  acquired_at   |  user   | hostname  |

+----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+

| Lock ID  | Database    | Table                  | Partition                          | State       | Type         | Transaction ID  | Last Hearbeat   | Acquired At    | User    | Hostname  |

| 1711     | scratchpad  | non_partitioned_table  | NULL                               | ACQUIRED    | SHARED_READ  | NULL            | 1428862154670   | 1428862151904  | hduser  | rhes564   |

| 1711     | scratchpad  | partitioned_table      | country=Italy/year=2014/month=Feb  | ACQUIRED    | EXCLUSIVE    | NULL            | 1428862154670   | 1428862151905  | hduser  | rhes564   |

+----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+

 

Now your point 2, bucketing in Hive refers to hash partitioning where a hashing function is applied. Likewise an RDBMS, Hive will apply a linear hashing algorithm to prevent data from clustering within specific partitions. Hashing is very effective if the column selected for bucketing has very high selectivity like an ID column where selectivity (select count(distinct(column))/count(column) ) = 1.  In this case, the created partitions/ files will be as evenly sized as possible. In a nutshell bucketing is a method to get data evenly distributed over many partitions/files.  One should define the number of buckets by a power of two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing will help concurrency in Hive. It may even allow a partition wise join i.e. a join between two tables that are bucketed on the same column with the same number of buckets (anyone has tried this?)

 

One more things. When one defines the number of buckets at table creation level in Hive, the number of partitions/files will be fixed. In contrast, with partitioning you do not have this limitation. 

 

HTH

 

Mich

 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Ashok Kumar [mailto:ashok34668@yahoo.com] 
Sent: 10 April 2015 17:46
To: user@hive.apache.org
Subject: partition and bucket

 


Greeting all,

Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL.

I would like to understand partition and bucketing in Hive and the difference between.

Shall be grateful if someone explains where shall I use partition or bucket for best practices.

thanks