You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Igor Kuzmenko <f1...@gmail.com> on 2016/04/08 11:35:15 UTC

Hive Hcatalog Streaming. Why hive table must be bucketed?

Hello I've got few questions about Hive HCatalog streaming
<https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest>.
This feature has requirement:
"*The Hive table must be bucketed
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables>,
but not sorted. So something like “clustered by (colName) into 10 **buckets”
must be specified during table creation. The number of buckets is ideally
the same as the number of streaming writers*."

1) I wonder why it is required condition of streaming?
2) How many buckets should I create, when number of streaming writers
changes over time (for example from 1 to 10)?

Re: Hive Hcatalog Streaming. Why hive table must be bucketed?

Posted by Eugene Koifman <ek...@hortonworks.com>.
HCatalog streaming works with Hive's transactional tables which are currently required to be bucketed.
The later is to improve read performance since these tables also support update/delete operations (though not via streaming).

"The number of buckets is ideally ..." this is obsolete (as of HIVE-11983).  There isn't really a relationship.  Each HiveEndPoint will write as many files as you have buckets.

Eugene



From: Igor Kuzmenko <f1...@gmail.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, April 8, 2016 at 2:35 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Hive Hcatalog Streaming. Why hive table must be bucketed?

Hello I've got few questions about Hive HCatalog streaming<https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest>. This feature has requirement:
"The Hive table must be bucketed<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables>, but not sorted. So something like "clustered by (colName) into 10 buckets" must be specified during table creation. The number of buckets is ideally the same as the number of streaming writers."

1) I wonder why it is required condition of streaming?
2) How many buckets should I create, when number of streaming writers changes over time (for example from 1 to 10)?