You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@livy.apache.org by "Bikas Saha (Jira)" <ji...@apache.org> on 2019/12/29 09:55:00 UTC
[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

    [ https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004728#comment-17004728 ] 

Bikas Saha commented on LIVY-718:
---------------------------------

Thank you for the taking this up and providing the detailed design document.

After reviewing the document I have a few observations.

 

There are 2 aspects to active active HA.
 # Higher service availability - The service continues to be available with (little to no) downtime visible to clients. At the minimum this means no loss of operational data that will negatively affect clients correctness or success of operations. Beyond that the service can try to make the disruption as transparently and low impact as possible for clients.
 # Higher service operations - Multiple active servers can service more client/operations. This can be done even without high service availability (1 above).

 

In order to achieve the above several items need to be covered
 # High availability of service state data - This enables the work of a crashed service instance to be taken over by a different (passive or active) service instance. This can be done by storage in a reliable store (HDFS like) or by an HA state maintained within the services instances (Paxos-like).
 ## In Livy's case the actual service data is stored in the Spark process running the Livy session. Hence the minimum (and perhaps only) data that needs to be reliably stored is the Spark driver information (and maybe some other metadata like ACL's, ownership etc.). Any other data can be retrieved from the running Spark driver and in fact should not be duplicated since there should be only 1 source of truth.
 ## If the above is correct then perhaps the reliable data has a low update rate. An initial data about the session and an update for the spark driver information. This could be provided by many different systems (even slow ones with HDFS).
 ## To provide choice of storage, we should consider making it a pluggable such that all store operations go through the plugin interface and plugins are loaded at runtime based on configuration (class and jar). Plugins could be in-memory (current), any HDFS compatible FS, ZooKeeper etc. IIRC pluggable reliable storage is already available and we need to just check if the data stored is sufficient and perhaps trim extra data.
 # Service availability for a given object - This means client can continue to access/operate on a given object despite service instance failure. This in the minimum involves some other service instance being able to load that object metadata and provide operations on it. With active passive this would mean a different single server becomes the designated. OR with active active any other available server can provide the operations.
 ## In the latter case any server could service the request or a new designated server is chosen. The design document chooses a designated server approach (potentially via consistent hashing). I am curious why. Is it because there are issues with multiple servers handling multiple clients to the same session? If yes then what are those issues? Could these issues be handled instead such that any/multiple servers could service a session? The reason I ask this is because
 ### Designated server is very similar to active-passive systems and as such brings its own complications like ensuring a truly single active instance (e.g. when a previously active server comes back or was not really crashed)
 ### With consistent hashing or any other scheme we have to take care of ensure load balancing of objects across servers so that we do not create hot servers (handling majority of the objects) by design. Which could be the case with integer session ids and hashing.
 ### If we do need a designated server to be chosen then the document needs to clarify how all the cases with leader election and old leader lockout, failures will be handled. Otherwise, I am worried that we might find many issues during deployment. How will load balancing be handled?
 ## Session id - I would strongly suggest deprecating the integral session id. IIRC we may not have built any super important features that depend on a monotonically increasing number and so there may not be much value in continuing to support it (for the service or the client). It adds an avoidable coordination issue among multiple active servers and any bug in there would have bad consequences with duplicated session ids - which can lead to security issues where one user reads data from a session they are not entitled to. Moving to a UUID scheme removes the potential of any kind of overloading of this id and also removes the need for coordination. Also it might help with better load balancing if we have to use a designated server per object.
 # Service discovery - How does a client discover a server to operate on a given object. How does it discover a new server when the previous server fails. Mostly this is not under the service's control since the service does not control DNS, load balancers etc. A load balancer could provide name transparency to the client but if the backend gets unbalanced due to hot servers then it won't help. Clients that are not supported by dynamic DNS/IP re-mapping or load balancing would necessarily be affected by the loss of their current server and would have to find out who their new server is. That discovery problem is worse when an object can be serviced by a single designated server. So here too, if any active server can serve a request then it helps the clients.

In addition to the above I am generally worried about adding the dependency on ZK for multiple parts of the problem. One of the good things about livy is that its low overhead operationally to setup and operate. That will change drastically as ZK is a pain to operationalize for most users. So if we can avoid the dependency on ZK then our Livy HA story can be a lot more easy and appealing to our users. Per what I see in the document and my notes above I am not sure that there are any issues here that necessitate the high cost of introducing ZK.

 

/cc some of the Livy veterans to confirm/deny my observations as I have not been directly involved for a while. [~jerryshao]  [~zjffdu] [~ajbozarth]  [~mgaido]

 

> Support multi-active high availability in Livy
> ----------------------------------------------
>
>                 Key: LIVY-718
>                 URL: https://issues.apache.org/jira/browse/LIVY-718
>             Project: Livy
>          Issue Type: Epic
>          Components: RSC, Server
>            Reporter: Yiheng Wang
>            Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in some production environments. In our scenario, the Livy server serves many notebook and JDBC services. We want to make Livy service more fault-tolerant and scalable.
> There're already some proposals in the community for high availability. But they're not so complete or just for active-standby high availability. So we propose a multi-active high availability design to achieve the following goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)