You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@dubbo.apache.org by GitBox <gi...@apache.org> on 2021/04/10 16:43:02 UTC

[GitHub] [dubbo-go] zhaoyunxing92 opened a new issue #1141: 在k8s中如果使用node的ip容易导致服务被下线客户端找不到服务

zhaoyunxing92 opened a new issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141


   k8s服务如果使用的是node的ip在滚动升级的时候触发下线事件就会把刚刚注册的服务下线掉，导致客户端找不到服务，代码流程如下
   > registry/directory/directory.go:124
   ```go
   // refreshInvokers refreshes service's events.
   func (dir *RegistryDirectory) refreshInvokers(event *registry.ServiceEvent) {
         ....
   	if event != nil {
                   // 根据事件缓存Invoker
   		oldInvoker, _ = dir.cacheInvokerByEvent(event)
   	}
   	....
   }
   ```
   
   > registry/directory/directory.go:240
   ```go
   // cacheInvokerByEvent caches invokers from the service event
   func (dir *RegistryDirectory) cacheInvokerByEvent(event *registry.ServiceEvent) (protocol.Invoker, error) {
   	// judge is override or others
   	if event != nil {
   		u := dir.convertUrl(event)
   		switch event.Action {
   		case remoting.EventTypeAdd, remoting.EventTypeUpdate:
   			logger.Infof("selector add service url{%s}", event.Service)
   			if u != nil && constant.ROUTER_PROTOCOL == u.Protocol {
   				dir.configRouters()
   			}
   			return dir.cacheInvoker(u), nil
   		case remoting.EventTypeDel:
   			logger.Infof("selector delete service url{%s}", event.Service)
                          // 如果是删除类型事件
   			return dir.uncacheInvoker(u), nil
   		default:
   			return nil, fmt.Errorf("illegal event type: %v", event.Action)
   		}
   	}
   	return nil, nil
   }
   ```
   > registry/directory/directory.go:327
   
   ``` go
   // uncacheInvoker will return abandoned Invoker, if no Invoker to be abandoned, return nil
   func (dir *RegistryDirectory) uncacheInvoker(url *common.URL) protocol.Invoker {
   	return dir.uncacheInvokerWithKey(url.Key())
   }
   ```
   > common/url.go:341
   ``` go
   // Key gets key
   func (c *URL) Key() string {
   	buildString := fmt.Sprintf("%s://%s:%s@%s:%s/?interface=%s&group=%s&version=%s",
   		c.Protocol, c.Username, c.Password, c.Ip, c.Port, c.Service(), c.GetParam(constant.GROUP_KEY, ""), c.GetParam(constant.VERSION_KEY, ""))
   	return buildString
   }
   ```
   关键代码就是`Key()`方法，因为`ip`、`port`等等有一样
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: Imp: delete a service provider when using k8s hpa

Posted by GitBox <gi...@apache.org>.

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   根据与开课啦那边同学的沟通，整体环境是在一个 k8s 环境下，使用注册中心 zk，其出问题的过程如下：
   
   1 service A【以下简称sA】 在物理主机 host X【以下简称 hX】上有一个服务节点 provider M【以下简称 pM】；
   2 pM 向注册中心注册使用的 ip:port 不是其所在的 pod 的 ip:port，使用了 hX 的 IP:port，原因是为了让 k8s 集群外部的 consumer 也能调用到 pM 提供的 sA 服务；
   3 在 hX 上新启动一个 sA 的节点 provider N【以下简称 pN】，pN 向 注册中心注册的 IP:Port 也使用了 hX 的 IP:Port；
   4 待 pN 稳定运行一段时间后，下线 pM；
   5 consumer 收到 pM 下线事件后，本地缓存中，由于 pM 和 pN 的 service key 一样，把 pM 和 pN 都下线了。
   
   分析其过程，根因是其 devops 部署的问题，但是希望能在 dubbo/dubbogo 层把这个问题吃掉。使用方希望能根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 经 nacos 这边同学李志鹏确认，nacos 可以确保事件通知的有序性，原话是 “同一个节点只要保证客户端服务上线和下线操作是有序的，通知也是有序的”。
   
   其次，根据对代码的分析，改进方法如下：
   1 收到下线事件时，先检测 service key 对应的 provider 最近【一个心跳周期内】是否还在被调用，如果还在被调用，则不下线，最终下线与否取决于 healthCheck 的结果；
   2 如果在最近没有被调用，然后再比对注册中心下线事件通知里的timestamp字段，如果相同则下线。
   ![image](https://user-images.githubusercontent.com/7959374/114296196-85034a00-9adc-11eb-9aff-0ecef00fa7ca.png)
   
   通过这个双保险把误下线的概率降到最低。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: Imp: delete a service provider when using k8s hpa

Posted by GitBox <gi...@apache.org>.

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   ## 背景
   
   根据与开课啦那边同学的沟通，整体环境是在一个 k8s 环境下，使用注册中心 zk，其出问题的过程如下：
   
   1 service A【以下简称sA】 在物理主机 host X【以下简称 hX】上有一个服务节点 provider M【以下简称 pM】；
   2 pM 向注册中心注册使用的 IP:Port 不是其所在的 pod 的 IP:Port，使用了 hX 的 IP:Port，原因是为了让 k8s 集群外部的 consumer 也能调用到 pM 提供的 sA 服务；
   3 在 hX 上新启动一个 sA 的节点 provider N【以下简称 pN】，pN 向 注册中心注册的 IP:Port 也使用了 hX 的 IP:Port；
   4 consumer 收到注册中心发来的 pN 的上线通知 event，由于 pM 和 pN 的 service key 一样，把本地缓存中 service map 中的 pM 替换为 pN；
   5 待 pN 稳定运行一段时间后，下线 pM；
   6 consumer 收到 pM 下线 event 后，在本地缓存中删除 pN。
   
   ## 分析
   
   分析其过程，根因是其 devops 部署的问题，但是希望能在 dubbo/dubbogo 层把这个问题吃掉。使用方希望能根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   dubbo 不会出现这个问题，原因是：dubbo consumer 每次收到一个服务的 event 通知，都去注册中心把 服务 下所有 provider 拉取下来，然后数据同步到本地【即拉取数据的 snapshot】，保证 consumer 本地与注册中心数据的一致性。
   
   而 dubbogo 的处理方式与 dubbo 不同：dubbogo consumer 除了启动时拉取全量数据外，大部分情况下都是根据 event 做增量数据分析，然后更新 consumer 本地的 provider 地址缓存。
   
   在大量服务场景下，dubbogo 的处理方式是具有很大的性能优势：假设某服务有 1000 个 provider，1000 个 consumer，则对服务进行滚动升级时，dubbo 对服务端压力是：consumer 到 注册中心 拉取 1000 * 1000 = 1000 000 次，单次拉取的数据量是 1000 个  metadata，其数据量是 1000 000 000 个 metadata；dubbogo 对注册中心的压力是：注册中心到 consumer 1000 * 1000 = 1000 000 次事件下发，每次下发的数据量是单个 provider 的下线和上线，其数据量是 2000 000 个 metadata。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 经 nacos 这边同学李志鹏确认，nacos 可以确保事件通知的有序性，原话是 “同一个节点只要保证客户端服务上线和下线操作是有序的，通知也是有序的”。
   
   ## 解决方案
   
   根据对代码的分析，最小改进方法是对 consumer 端服务缓存的 key 中加入时间维度，代码如下：
   
   ```go
   // common/url.go
   // CacheInvokerMapKey get dir cacheInvokerMap key
   func (c *URL) CacheInvokerMapKey() string {
   	urlNew, _ := NewURL(c.PrimitiveURL)
   
   	buildString := fmt.Sprintf("%s://%s:%s@%s:%s/?interface=%s&group=%s&version=%s&timestamp=%s",
   		c.Protocol, c.Username, c.Password, c.Ip, c.Port, c.Service(), c.GetParam(constant.GROUP_KEY, ""),
   		c.GetParam(constant.VERSION_KEY, ""), urlNew.GetParam(constant.TIMESTAMP_KEY, ""))
   	return buildString
   }
   ```
   
   终极的改进方案是：
   
   1 收到下线事件时，先检测 service key 对应的 provider 最近【一个心跳周期内】是否还在被调用，如果还在被调用，则不下线，最终下线与否取决于 healthCheck 的结果；
   2 如果在最近没有被调用，然后再比对注册中心下线事件通知里的timestamp字段，如果相同则下线。
   ![image](https://user-images.githubusercontent.com/7959374/114296196-85034a00-9adc-11eb-9aff-0ecef00fa7ca.png)
   
   通过这个双保险把误下线的概率降到最低。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: 在k8s中如果使用node的ip容易导致服务被下线客户端找不到服务

Posted by GitBox <gi...@apache.org>.

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   如果有问题，肯定都有问题，处理机制类似。先根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 nacos不确定，我去咨询下。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: Imp: delete a service provider when using k8s hpa

Posted by GitBox <gi...@apache.org>.

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   根据与开课啦那边同学的沟通，整体环境是在一个 k8s 环境下，使用注册中心 zk，其出问题的过程如下：
   
   1 service A【以下简称sA】 在物理主机 host X【以下简称 hX】上有一个服务节点 provider M【以下简称 pM】；
   2 pM 向注册中心注册使用的 ip:port 不是其所在的 pod 的 ip:port，使用了 hX 的 IP:port，原因是为了让 k8s 集群外部的 consumer 也能调用到 pM 提供的 sA 服务；
   2 在 hX 上新启动一个 sA 的节点 provider N【以下简称 pN】，pN 向 注册中心注册的 IP:Port 也使用了 hX 的 IP:Port；
   3 待 pN 稳定运行一段时间后，下线 pM；
   4 consumer 收到 pM 下线事件后，本地缓存中，由于 pM 和 pN 的 service key 一样，把 pM 和 pN 都下线了。
   
   分析其过程，根因是其 devops 部署的问题，但是希望能在 dubbo/dubbogo 层把这个问题吃掉。使用方希望能根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 nacos不确定，我去咨询下。
   
   其次，根据对代码的分析，改进方法如下：
   1 收到下线事件时，先检测 service key 对应的 provider 最近【一个心跳周期内】是否还在被调用，如果还在被调用，则不下线，最终下线与否取决于 healthCheck 的结果；
   2 如果在最近没有被调用，然后再比对注册中心下线事件通知里的timestamp字段，如果相同则下线。
   ![image](https://user-images.githubusercontent.com/7959374/114296196-85034a00-9adc-11eb-9aff-0ecef00fa7ca.png)
   
   通过这个双保险把误下线的概率降到最低。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: 在k8s中如果使用node的ip容易导致服务被下线客户端找不到服务

Posted by GitBox <gi...@apache.org>.

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   根据与开课啦那边同学的沟通，整体环境是在一个 k8s 环境下，使用注册中心 zk，其出问题的过程如下：
   
   1 service A【以下简称sA】 在物理主机 host X【以下简称 hX】上有一个服务节点 provider M【以下简称 pM】；
   2 pM 向注册中心注册使用的 ip:port 不是其所在的 pod 的 ip:port，使用了 hX 的 IP:port，原因是为了让 k8s 集群外部的 consumer 也能调用到 pM 提供的 sA 服务；
   2 在 hX 上新启动一个 sA 的节点 provider N【以下简称 pN】，pN 向 注册中心注册的 IP:Port 也使用了 hX 的 IP:Port；
   3 待 pN 稳定运行一段时间后，下线 pM；
   4 consumer 收到 pM 下线事件后，本地缓存中，由于 pM 和 pN 的 service key 一样，把 pM 和 pN 都下线了。
   
   如果有问题，肯定都有问题，处理机制类似。先根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 nacos不确定，我去咨询下。
   
   其次，根据对代码的分析，改进方法如下：
   1 收到下线事件时，先检测 service key 对应的 provider 最近【一个心跳周期内】是否还在被调用，如果还在被调用，则不下线，最终下线与否取决于 healthCheck 的结果；
   2 如果在最近没有被调用，然后再比对注册中心下线事件通知里的timestamp字段，如果相同则下线。
   
   通过这个双保险把误下线的概率降到最低。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: Imp: delete a service provider when using k8s hpa

Posted by GitBox <gi...@apache.org>.

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   ## 背景
   
   根据与开课啦那边同学的沟通，整体环境是在一个 k8s 环境下，使用注册中心 zk，其出问题的过程如下：
   
   1 service A【以下简称sA】 在物理主机 host X【以下简称 hX】上有一个服务节点 provider M【以下简称 pM】；
   2 pM 向注册中心注册使用的 IP:Port 不是其所在的 pod 的 IP:Port，使用了 hX 的 IP:Port，原因是为了让 k8s 集群外部的 consumer 也能调用到 pM 提供的 sA 服务；
   3 在 hX 上新启动一个 sA 的节点 provider N【以下简称 pN】，pN 向 注册中心注册的 IP:Port 也使用了 hX 的 IP:Port；
   4 consumer 收到注册中心发来的 pN 的上线通知 event，由于 pM 和 pN 的 service key 一样，把本地缓存中 service map 中的 pM 替换为 pN；
   5 待 pN 稳定运行一段时间后，下线 pM；
   6 consumer 收到 pM 下线 event 后，在本地缓存中删除 pN。
   
   ## 分析
   
   分析其过程，根因是其 devops 部署的问题，但是希望能在 dubbo/dubbogo 层把这个问题吃掉。使用方希望能根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   dubbo 不会出现这个问题，原因是：dubbo consumer 每次收到一个服务的 event 通知，都去注册中心把 服务 下所有 provider 拉取下来，然后数据同步到本地【即拉取数据的 snapshot】，保证 consumer 本地与注册中心数据的一致性。
   
   而 dubbogo 的处理方式与 dubbo 不同：dubbogo consumer 除了启动时拉取全量数据外，大部分情况下都是根据 event 做增量数据分析，然后更新 consumer 本地的 provider 地址缓存。
   
   在大量服务场景下，dubbogo 的处理方式是具有很大的性能优势：假设某服务有 1000 个 provider，1000 个 consumer，则对服务进行滚动升级时，dubbo 对服务端压力是：consumer 到 注册中心 拉取 2 * 1000 * 1000 = 2000 000 次【2 的意思是 某个服务节点先上线新的，然后再下线老的，会产生两次通知事件】，单次拉取的数据量是 1000 个  metadata，其数据量是 2000 000 000 个 metadata；dubbogo 对注册中心的压力是：注册中心到 consumer 1000 * 1000 = 1000 000 次事件下发，每次下发的数据量是单个 provider 的下线和上线，其数据量是 2000 000 个 metadata。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 经 nacos 这边同学李志鹏确认，nacos 可以确保事件通知的有序性，原话是 “同一个节点只要保证客户端服务上线和下线操作是有序的，通知也是有序的”。
   
   ## 解决方案
   
   根据对代码的分析，最小改进方法是对 consumer 端服务缓存的 key 中加入时间维度，代码如下：
   
   ```go
   // common/url.go
   // CacheInvokerMapKey get dir cacheInvokerMap key
   func (c *URL) CacheInvokerMapKey() string {
   	urlNew, _ := NewURL(c.PrimitiveURL)
   
   	buildString := fmt.Sprintf("%s://%s:%s@%s:%s/?interface=%s&group=%s&version=%s&timestamp=%s",
   		c.Protocol, c.Username, c.Password, c.Ip, c.Port, c.Service(), c.GetParam(constant.GROUP_KEY, ""),
   		c.GetParam(constant.VERSION_KEY, ""), urlNew.GetParam(constant.TIMESTAMP_KEY, ""))
   	return buildString
   }
   ```
   
   终极的改进方案是：
   
   1 收到下线事件时，先检测 service key 对应的 provider 最近【一个心跳周期内】是否还在被调用，如果还在被调用，则不下线，最终下线与否取决于 healthCheck 的结果；
   2 如果在最近没有被调用，然后再比对注册中心下线事件通知里的timestamp字段，如果相同则下线。
   ![image](https://user-images.githubusercontent.com/7959374/114296196-85034a00-9adc-11eb-9aff-0ecef00fa7ca.png)
   
   通过这个双保险把误下线的概率降到最低。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: Imp: delete a service provider when using k8s hpa

Posted by GitBox <gi...@apache.org>.

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   根据与开课啦那边同学的沟通，整体环境是在一个 k8s 环境下，使用注册中心 zk，其出问题的过程如下：
   
   1 service A【以下简称sA】 在物理主机 host X【以下简称 hX】上有一个服务节点 provider M【以下简称 pM】；
   2 pM 向注册中心注册使用的 IP:Port 不是其所在的 pod 的 IP:Port，使用了 hX 的 IP:Port，原因是为了让 k8s 集群外部的 consumer 也能调用到 pM 提供的 sA 服务；
   3 在 hX 上新启动一个 sA 的节点 provider N【以下简称 pN】，pN 向 注册中心注册的 IP:Port 也使用了 hX 的 IP:Port；
   4 consumer 收到注册中心发来的 pN 的上线通知 event，由于 pM 和 pN 的 service key 一样，把本地缓存中 service map 中的 pM 替换为 pN；
   5 待 pN 稳定运行一段时间后，下线 pM；
   6 consumer 收到 pM 下线 event 后，在本地缓存中删除 pN。
   
   分析其过程，根因是其 devops 部署的问题，但是希望能在 dubbo/dubbogo 层把这个问题吃掉。使用方希望能根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 经 nacos 这边同学李志鹏确认，nacos 可以确保事件通知的有序性，原话是 “同一个节点只要保证客户端服务上线和下线操作是有序的，通知也是有序的”。
   
   其次，根据对代码的分析，改进方法如下：
   1 收到下线事件时，先检测 service key 对应的 provider 最近【一个心跳周期内】是否还在被调用，如果还在被调用，则不下线，最终下线与否取决于 healthCheck 的结果；
   2 如果在最近没有被调用，然后再比对注册中心下线事件通知里的timestamp字段，如果相同则下线。
   ![image](https://user-images.githubusercontent.com/7959374/114296196-85034a00-9adc-11eb-9aff-0ecef00fa7ca.png)
   
   通过这个双保险把误下线的概率降到最低。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks commented on issue #1141: 在k8s中如果使用node的ip容易导致服务被下线客户端找不到服务

Posted by GitBox <gi...@apache.org>.

AlexStocks commented on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   如果有问题，肯定都有问题，处理机制类似。先根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: 在k8s中如果使用node的ip容易导致服务被下线客户端找不到服务

Posted by GitBox <gi...@apache.org>.

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   根据与开课啦那边同学的沟通，整体环境是在一个 k8s 环境下，使用注册中心 zk，其出问题的过程如下：
   
   1 service A【以下简称sA】 在物理主机 host X【以下简称 hX】上有一个服务节点 provider M【以下简称 pM】；
   2 pM 向注册中心注册使用的 ip:port 不是其所在的 pod 的 ip:port，使用了 hX 的 IP:port，原因是为了让 k8s 集群外部的 consumer 也能调用到 pM 提供的 sA 服务；
   2 在 hX 上新启动一个 sA 的节点 provider N【以下简称 pN】，pN 向 注册中心注册的 IP:Port 也使用了 hX 的 IP:Port；
   3 待 pN 稳定运行一段时间后，下线 pM；
   4 consumer 收到 pM 下线事件后，本地缓存中，由于 pM 和 pN 的 service key 一样，把 pM 和 pN 都下线了。
   
   分析其过程，根因是其 devops 部署的问题，但是希望能在 dubbo/dubbogo 层把这个问题吃掉。使用方希望能根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 nacos不确定，我去咨询下。
   
   其次，根据对代码的分析，改进方法如下：
   1 收到下线事件时，先检测 service key 对应的 provider 最近【一个心跳周期内】是否还在被调用，如果还在被调用，则不下线，最终下线与否取决于 healthCheck 的结果；
   2 如果在最近没有被调用，然后再比对注册中心下线事件通知里的timestamp字段，如果相同则下线。
   
   通过这个双保险把误下线的概率降到最低。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] zhaoyunxing92 commented on issue #1141: 在k8s中如果使用node的ip容易导致服务被下线客户端找不到服务

Posted by GitBox <gi...@apache.org>.

zhaoyunxing92 commented on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817168737


   其他获取key的方法是不是也有这样的问题呢


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org

[GitHub] [dubbo-go] AlexStocks edited a comment on issue #1141: Imp: delete a service provider when using k8s hpa

Posted by GitBox <gi...@apache.org>.

AlexStocks edited a comment on issue #1141:
URL: https://github.com/apache/dubbo-go/issues/1141#issuecomment-817224471


   > 其他获取key的方法是不是也有这样的问题呢
   
   根据与开课啦那边同学的沟通，整体环境是在一个 k8s 环境下，使用注册中心 zk，其出问题的过程如下：
   
   1 service A【以下简称sA】 在物理主机 host X【以下简称 hX】上有一个服务节点 provider M【以下简称 pM】；
   2 pM 向注册中心注册使用的 ip:port 不是其所在的 pod 的 ip:port，使用了 hX 的 IP:port，原因是为了让 k8s 集群外部的 consumer 也能调用到 pM 提供的 sA 服务；
   2 在 hX 上新启动一个 sA 的节点 provider N【以下简称 pN】，pN 向 注册中心注册的 IP:Port 也使用了 hX 的 IP:Port；
   3 待 pN 稳定运行一段时间后，下线 pM；
   4 consumer 收到 pM 下线事件后，本地缓存中，由于 pM 和 pN 的 service key 一样，把 pM 和 pN 都下线了。
   
   分析其过程，根因是其 devops 部署的问题，但是希望能在 dubbo/dubbogo 层把这个问题吃掉。使用方希望能根据通知下线中一些可靠字段【譬如 timestamp？】确认下线服务的准确性。
   
   这里面有个前提是：注册中心通知事件的有序性。
   
   梳理下我们支持的注册中心对这个特性的支持特点:
   1 etcd 有 revision 的概念， 这个是数据的全局版本号，是可以保证有序的；
   2 基于 etcd 的 k8s 也是可以保证的；
   3 类似于 etcd 的 consul是可以保证的；
   4 zk 也可以保证有序，只不过事件可能丢，这个可以通过 dubbo/dubbogo health check 进行补偿；
   5 经 nacos 这边同学李志鹏确认，nacos 可以确保事件通知的有序性，原话是 “同一个节点只要保证客户端服务上线和下线操作是有序的，通知也是有序的”。
   
   其次，根据对代码的分析，改进方法如下：
   1 收到下线事件时，先检测 service key 对应的 provider 最近【一个心跳周期内】是否还在被调用，如果还在被调用，则不下线，最终下线与否取决于 healthCheck 的结果；
   2 如果在最近没有被调用，然后再比对注册中心下线事件通知里的timestamp字段，如果相同则下线。
   ![image](https://user-images.githubusercontent.com/7959374/114296196-85034a00-9adc-11eb-9aff-0ecef00fa7ca.png)
   
   通过这个双保险把误下线的概率降到最低。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@dubbo.apache.org
For additional commands, e-mail: notifications-help@dubbo.apache.org