You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Erik Krogen (JIRA)" <ji...@apache.org> on 2019/01/16 17:51:00 UTC
[jira] [Created] (HDFS-14211) [Consistent Observer Reads] Allow for configurable "always msync" mode

Erik Krogen created HDFS-14211:
----------------------------------

             Summary: [Consistent Observer Reads] Allow for configurable "always msync" mode
                 Key: HDFS-14211
                 URL: https://issues.apache.org/jira/browse/HDFS-14211
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: hdfs-client
            Reporter: Erik Krogen


To allow for reads to be serviced from an ObserverNode (see HDFS-12943) in a consistent way, an {{msync}} API was introduced (HDFS-13688) to allow for a client to fetch the latest transaction ID from the Active NN, thereby ensuring that subsequent reads from the ObserverNode will be up-to-date with the current state of the Active.

Using this properly, however, requires application-side changes: for examples, a NodeManager should call {{msync}} before localizing the resources for a client, since it received notification of the existence of those resources via communicate which is out-of-band to HDFS and thus could potentially attempt to localize them prior to the availability of those resources on the ObserverNode.

Until such application-side changes can be made, which will be a longer-term effort, we need to provide a mechanism for unchanged clients to utilize the ObserverNode without exposing such a client to inconsistencies. This is essentially phase 3 of the roadmap outlined in the [design document|https://issues.apache.org/jira/secure/attachment/12915990/ConsistentReadsFromStandbyNode.pdf] for HDFS-12943.

The design document proposes some heuristics based on understanding of how common applications (e.g. MR) use HDFS for resources. As an initial pass, we can simply have a flag which tells a client to call {{msync}} before _every single_ read operation. This may seem counterintuitive, as it turns every read operation into two RPCs: {{msync}} to the Active following by an actual read operation to the Observer. However, the {{msync}} operation is extremely lightweight, as it does not acquire the {{FSNamesystemLock}}, and in experiments we have found that this approach can easily scale to well over 100,000 {{msync}} operations per second on the Active (while still servicing approx. 10,000 write op/s). Combined with the fast-path edit log tailing for standby/observer nodes (HDFS-13150), this "always msync" approach should introduce only a few ms of extra latency to each read call.

Below are some experimental results collected from experiments which convert a normal RPC workload into one in which all read operations are turned into an {{msync}}. The baseline is a workload of 1.5k write op/s and 25k read op/s.

||Rate Multiplier|2|4|6|8||
||RPC Queue Avg Time (ms)|14.2|53.2|110.4|125.3||
||RPC Queue NumOps Avg (k)|51.4|102.3|147.8|177.9||
||RPC Queue NumOps Max (k)|148.8|269.5|306.3|312.4||

Results are promising up to between 4x and 6x of the baseline workload, which is approx. 100-150k read op/s.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org