You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/04/01 02:51:45 UTC
[GitHub] [ozone] guihecheng edited a comment on pull request #3254: HDDS-5327. EC: WritableEcContainerProvider should dynamically adjust the open container groups.

guihecheng edited a comment on pull request #3254:
URL: https://github.com/apache/ozone/pull/3254#issuecomment-1085357251


   Thanks @sodonnel for your time to take a look at this, this PR is just a possible proposal for discussing, and we need more concerns on some of the problems.
   
   It seems that there is another possible way proposed, and I've got some doubts below.
   
   > I read through the doc and the change here, and I am not sure tracking space is the correct way to solve this. I know that EC files should be large, but they will not always be. Also a large cluster does not always need a lot of EC pipelines open. 
   
   Well, I agree that we don't always need a lot of pipelines open, especially the IO load is low, but we need more often on large loads, and a larger cluster tends to serve larger loads. Assume EC 3:2, for 1 cluster of 100 nodes I think we may often need more open pipelines than a cluster of 5 nodes so as to utilize more DNs to carry IO loads(I've got a small test on a cluster of 30 DNs during the first discussion under the JIRA, more pipelines benefits performance before we drain the client bandwidth).
   And we don't have to worry about too many pipelines, the original code has a minimum for open pipelines to bound the number of pipelines, and it is still there, only with larger value for a larger cluster.
   
   > Sometimes, the main load may be from reads, or there is some rarely used EC policy (eg a tiny number of writes are using EC-3-2). That is why I think the "block allocation rate" is the best way to gauge the number of pipelines we need.
   
   Actually EC read is often slower than replicated read, so EC tends to serve large cold data that is rarely read, hot data is often cached or converted to 2-way/3-way replicated. And we tends to use only a few EC policies in a single cluster.
   
   > If we start with some sensible, but configurable minimum and some upper bound based on the registered nodes.
   
   Sure, upper bound should be related to registered nodes.
   But what is a "sensible" minimum in your mind then? Any possible contributing factors like the number of nodes in cluster, client bandwidth, or else?
   
   > Then we keep track of the block allocation requests per time period in the ECWritableContainerProvider and per EC policy. We can guess the time it takes for a client to write a full block - it will always be approximate. We don't know how much of the block will be filled, or if the writer is a slow writer streaming events, or a fast writer. We know the max MB they can write, as it will be the `blockSize * Required_Nodes_For_EC_Policy`, for 6-3, that will be `256 * 9 = 2304MB`. The data is written mostly serially, so guess 150MB/s, it will take about 15 seconds to write that block. We we scale that number back by some factor as not all blocks will be filled. Eg assume it is 50% of that.
   > 
   
   Here I disagrees that we should use an experience-based value like "150MB/s" to do estimation, because it depends largely on the hardware we use, e.g. 10GE NIC would outperforms much than 1GE ones, and we have even faster cards(25GE, 40GE). And you could imagine that different disks will contribute to throughput in a similar way, even we don't tend to use SSDs for EC.
   Other factors like client concurrency, other co-existing services sharing the resource of the client, complex network topology involves switches and racks all contributes to the IO speed.
   
   > If we are seeing 10 block requests per second for an EC policy, and it takes 15 seconds to write the full block, perhaps we need 10 * 15 = 150 pipelines, or we can scale that by `block_fill_factor`. If the load drops to 1 request per second, we only need 10.
   > 
   > The other thing we need to consider, is that Ratis pipelines can have many open containers on a single pipeline, and each container is constrained to a single disk. An EC pipleline only has a single container and hence a single disk on a DN. So we need to consider the number of disks on the DNs as well as the number of nodes.
   
   Here I raised the point in the doc that, we need to consider at least Disks and Network at the same time for a single DN, because performance of a single DN is bounded largely on these 2 factors at the same time, but ozone don't collect info on NICs and most storage systems doesn't.
   On the ozone SCM side, most placement polices only considers nodes and let DN itself manages the disks.
   So let's start with a configured limit as Ratis does, later we could introduces more calculation-based, reasonable values. 
   
   > I am not sure how fine grained we would need to track the request rate, eg per second, per 10 seconds, per minute. Or should we have something like the Linux top command were it has the 1, 5 and 15 minute average, and if we did have that, how would we use it?
   
   Yeah, exactly, there's hardly any rationale to track the request rate as a hint for resource allocation, right? 
   Usually we only have request rates just as monitor metrics to let us understand the load of the system.
   
   > I feel the existing close logic should handle containers filling OK without having to worry about it in the WriteableContainer provider. The DN triggers the close at some percentage full, expecting more blocks will continue to be written. For EC containers the problem is even less as the blocks are spread across the replicas more than with Ratis.
   
   I agree with this point, and I don't touch the close logic in the PR, the allocatedSpace is only a hint for pre-allocating new pipelines, we don't force close the pipeline/container.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org