You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Eugene Koifman (JIRA)" <ji...@apache.org> on 2015/09/16 21:46:45 UTC

[jira] [Updated] (HIVE-11683) Hive Streaming may overload the metastore

     [ https://issues.apache.org/jira/browse/HIVE-11683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koifman updated HIVE-11683:
----------------------------------
    Component/s: Metastore

> Hive Streaming may overload the metastore
> -----------------------------------------
>
>                 Key: HIVE-11683
>                 URL: https://issues.apache.org/jira/browse/HIVE-11683
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Hive, Metastore, Transactions
>    Affects Versions: 1.0.0
>            Reporter: Eugene Koifman
>            Assignee: Roshan Naik
>
> HiveEndPoint represents a way to write to a specific partition transactionally.
> Each HiveEndPoint creates TransactionBatch(es) and commits transactions.
> Suppose you have 10 instances of Storm Hive bolt using Streaming API.
> Each instance will create HiveEndPoints on demand when it sees an event for particular partition value.
> If events are uniformly distributed wrt partition values and the table has 1000 partitions (for example it's partitioned by CustomerId), each of 10 bolt instances may create 1000 HiveEndPoints and thus > 10,000 (actually 10K * num_txn_per_batch) concurrent transactions.
> This creates huge amount of Metastore traffic.
> HIVE-11672 is investigating how some sort of "shuffle" phase can be added route events for a particular bucket to the same bolt instance.
> The same idea should explored to route events based on partition value.
> cc [~alangates],[~sriharsha],[~rbains]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)