You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@dubbo.apache.org by GitBox <gi...@apache.org> on 2021/04/11 07:42:47 UTC
[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: Imp: delete a service provider when using k8s hpa

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   根据与开课啦那边同学的沟通，整体环境是在一个 k8s 环境下，使用注册中心 zk，其出问题的过程如下：
   
   1 service A【以下简称sA】 在物理主机 host X【以下简称 hX】上有一个服务节点 provider M【以下简称 pM】；
   2 pM 向注册中心注册使用的 ip:port 不是其所在的 pod 的 ip:port，使用了 hX 的 IP:port，原因是为了让 k8s 集群外部的 consumer 也能调用到 pM 提供的 sA 服务；
   2 在 hX 上新启动一个 sA 的节点 provider N【以下简称 pN】，pN 向 注册中心注册的 IP:Port 也使用了 hX 的 IP:Port；
   3 待 pN 稳定运行一段时间后，下线 pM；
   4 consumer 收到 pM 下线事件后，本地缓存中，由于 pM 和 pN 的 service key 一样，把 pM 和 pN 都下线了。
   
   分析其过程，根因是其 devops 部署的问题，但是希望能在 dubbo/dubbogo 层把这个问题吃掉。使用方希望能根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 nacos不确定，我去咨询下。
   
   其次，根据对代码的分析，改进方法如下：
   1 收到下线事件时，先检测 service key 对应的 provider 最近【一个心跳周期内】是否还在被调用，如果还在被调用，则不下线，最终下线与否取决于 healthCheck 的结果；
   2 如果在最近没有被调用，然后再比对注册中心下线事件通知里的timestamp字段，如果相同则下线。
   ![image](https://user-images.githubusercontent.com/7959374/114296196-85034a00-9adc-11eb-9aff-0ecef00fa7ca.png)
   
   通过这个双保险把误下线的概率降到最低。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org