You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/05/15 17:54:04 UTC
[jira] [Commented] (DRILL-5510) Revisit connection failure recovery in Hive storage plugin

    [ https://issues.apache.org/jira/browse/DRILL-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010989#comment-16010989 ] 

Paul Rogers commented on DRILL-5510:
------------------------------------

More details. The Hive client in the Hive storage plugin is not designed to handle security.

* When we start the Hive storage plugin, we create a single instance of the {{HiveSchemaFactory}}.
* {{HiveSchemaFactory}} holds on to a {{DrillHiveMetaStoreClient}} connection. In the secure case, this connection is used to get security certificates for us in creating secure connections.
* {{HiveSchemaFactory}} has a Guava loading cache of user-specific, secure connections.

When the Hive metastore goes down, all connections become invalid including the non-secure and all the secure connections. But, we try to handle the problem as follows.

If a secure connection times out:

* Use the (now-invalid) insecure connection to get another ticket. But, since this isn't valid, we can't reconnect and so always fail.

If we try to use a cached secure connection before timeout, then this happens:

* Try to send a message.
* When that fails, try to reconnect (using the old certificate for the prior session.)
* When that fails, give up.

What we really need to do is:

* Recreate both the insecure *and* secure connections.

But, since the secure connection cache is held on the insecure connection, we can't easily recreate that connection: we'd get a new object.

So, we have to make some changes.

* Hold the secure connection cache on an object other than a connection.
* Use a connection proxy instead of the connection as key to the cache. The proxy allows maintaining the cache entry, but replacing the secure connection with a new one. (The proxy is just a wrapper around a replacable secure connection.)
* Similarly, provide a thread-safe way to reconnect the non-secure connection used to get tickets for the secure connection.

All this is not a huge project, but it is more than can be done in the context of simple bug fix for DRILL-5496. So, for that ticket, I used a hack: just throw away the entire schema builder and create a new one. But, that solution requires synchronizing all requests and is far from ideal. This ticket is a request to create a better long-term solution.

> Revisit connection failure recovery in Hive storage plugin
> ----------------------------------------------------------
>
>                 Key: DRILL-5510
>                 URL: https://issues.apache.org/jira/browse/DRILL-5510
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>
> DRILL-5496 describes a problem which occurs when the Hive metastore server is restarted while Drill runs. The solution in that ticket is a work-around: we discard all cached Hive metastore data and rebuild the metadata cache.
> The original code tried to be more subtle: detecting that the connection has failed, reconnect, but preserve the cache. DRILL-5496 describes the flaws in that approach for the secure connection case.
> This ticket asks to spend the time to understand the Hive metadata code and restructure it to preserve the cache across connection failures.
> Note a subtle issue: if the Hive metastore goes down, when it comes back up, it may contain different data; anything could happen while the server is down: upgrade schemas, replace one schema with another, etc. So, the caching mechanism, if it is to preserve data across reconnects, must handle such changes.
> Of course, such changes could occur even within a single connection, so the code should handle such cases already.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)