You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@apisix.apache.org by GitBox <gi...@apache.org> on 2020/11/30 12:00:18 UTC
[GitHub] [apisix] gy09535 opened a new issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
gy09535 opened a new issue #2899:
URL: https://github.com/apache/apisix/issues/2899
### Issue description
Today I find some auth error from etcd, I try to connect etcd to find the error, I find so many auth request cause etcd api timeout, this is my etcd logs:
```
2020-11-30 19:55:01.150181 W | auth: invalid auth token: SDZBJcAaVedJqUFY.116415
2020-11-30 19:55:01.226868 W | auth: invalid auth token: MNtikXEJuXkIOWVt.114382
2020-11-30 19:55:01.238848 W | auth: invalid auth token: ArWxwPXGQRclPdZp.114301
2020-11-30 19:55:01.254654 W | auth: invalid auth token: WuqYtmtIDCGftGwr.114315
2020-11-30 19:55:01.283579 W | auth: invalid auth token: CqdVqeYChuiEWgWh.114324
2020-11-30 19:55:01.320021 W | auth: invalid auth token: YHlQCFrmBrGsHXXV.116397
2020-11-30 19:55:01.320712 W | auth: invalid auth token: YyOUXXpgpqQyNryf.114384
2020-11-30 19:55:01.359229 W | auth: invalid auth token: YXSKKPuaGObcCrFZ.114346
2020-11-30 19:55:01.367728 W | auth: invalid auth token: HFqBwCvzPfTRVxih.116421
2020-11-30 19:55:01.389121 W | auth: invalid auth token: KGVyAgnXWOgFDFYl.114375
2020-11-30 19:55:01.408058 W | auth: invalid auth token: EGSajOVclqMhLIqS.114317
2020-11-30 19:55:01.424942 W | auth: invalid auth token: qtXATtyhLZJSGIfB.116433
2020-11-30 19:55:01.480119 W | auth: invalid auth token: tqHxDVbjSKBVCDaS.114331
2020-11-30 19:55:01.485725 W | auth: invalid auth token: siQhqOvdtRUXsHiW.116425
2020-11-30 19:55:01.521545 W | auth: invalid auth token: xVPFwHcEMXScHhGZ.114338
2020-11-30 19:55:01.529845 W | auth: invalid auth token: vqtBsVIGKtNCjUpb.114336
2020-11-30 19:55:01.541498 W | auth: invalid auth token: XFqNUGjGYYsHhpWI.114345
2020-11-30 19:55:01.580206 W | auth: invalid auth token: CiBhQUGpwiUPczNo.114314
2020-11-30 19:55:01.580919 W | auth: invalid auth token: JVaWKqKTDxnhwZig.114351
2020-11-30 19:55:01.600957 W | auth: invalid auth token: OIdYBazTuoTeRyhV.114320
2020-11-30 19:55:01.628087 W | auth: invalid auth token: bSuTMTUaydeOSFPw.114311
2020-11-30 19:55:01.650066 W | auth: invalid auth token: CUImlKqMxLllTDAG.116422
2020-11-30 19:55:01.671630 W | auth: invalid auth token: MckpSvnXCslOBZzN.114350
2020-11-30 19:55:01.730228 W | auth: invalid auth token: ZoCzVsZVnxKkXgkh.114323
2020-11-30 19:55:01.809137 W | auth: invalid auth token: MNtikXEJuXkIOWVt.114382
2020-11-30 19:55:01.840882 W | auth: invalid auth token: eLrJyzLIfDCCtLgS.114352
2020-11-30 19:55:01.892522 W | auth: invalid auth token: mprbhPNbjByDExCW.116398
2020-11-30 19:55:01.973658 W | auth: invalid auth token: MMKVaKJYLCPjtYra.114335
2020-11-30 19:55:02.055462 W | auth: invalid auth token: eLrJyzLIfDCCtLgS.114352
2020-11-30 19:55:02.313522 W | auth: invalid auth token: NcKCHbtvAvQTJWap.116409
2020-11-30 19:55:02.348031 W | auth: invalid auth token: iQkxDrTcVsWzzBwz.114330
2020-11-30 19:55:02.588947 W | auth: invalid auth token: THlTKQvjLFueVrsC.116408
2020-11-30 19:55:02.620312 W | auth: invalid auth token: zEoNFjjLmZLFLujv.114328
2020-11-30 19:55:02.637622 W | auth: invalid auth token: jmyLhQNCVaHCuESg.114312
2020-11-30 19:55:02.760728 W | auth: invalid auth token: bloWjYzYVchKqYCb.114376
```
I get package from etcd serve and find this packages:
![image](https://user-images.githubusercontent.com/15153469/100607748-57481080-3346-11eb-9067-731603410649.png)
After I check the apisix code, I find this code can cause dead loop auth.
```
function refresh_jwt_token(self)
-- token exist and not expire
-- default is 5min, we use 3min
-- https://github.com/etcd-io/etcd/issues/8287
if self.jwt_token and now() - self.last_auth_time < 60 * 3 then
return true, nil
end
local opts = {
body = {
name = self.user,
password = self.password,
}
}
local res, err = _request_uri(self, 'POST',
choose_endpoint(self).full_prefix .. "/auth/authenticate",
opts, 5, true) -- default authenticate timeout 5 second
if err then
return nil, err
end
if not res or not res.body or not res.body.token then
return nil, 'authenticate refresh token fail'
end
self.jwt_token = res.body.token
self.last_auth_time = now()
return true, nil
end
```
ref: https://github.com/api7/lua-resty-etcd/blob/master/lib/resty/etcd/v3.lua#L221
### Environment
* apisix version (cmd: `apisix version`):
* OS:
### Minimal test code / Steps to reproduce the issue
1.
2.
3.
### What's the actual result? (including assertion message & call stack if applicable)
### What's the expected result?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737650987
Fore every auth retch, they will do 32 times fetch from this code
```
while not exiting() and self.running and i <= 32 do
i = i + 1
```
I think this ok will always true, if we do auth fail, and it can not sleep 3s.
```
if not ok then
log.error("failed to fetch data from etcd: ", err, ", ",
tostring(self))
ngx_sleep(3)
break
end
```
It will sleep 0.5 s from this code
```
if err then
if err ~= "timeout" and err ~= "Key not found"
and self.last_err ~= err then
log.error("failed to fetch data from etcd: ", err, ", ",
tostring(self))
end
if err ~= self.last_err then
self.last_err = err
self.last_err_time = ngx_time()
else
if ngx_time() - self.last_err_time >= 30 then
self.last_err = nil
end
end
ngx_sleep(0.5)
```
they are 16 batch job in on worker, this confused me ,it should be 8.
```
2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
```
for one worker, they are 16*2=32 requests (max) in on seconds , why we try to fetch 32 times in on auth fetch, I think we should remove it.
```
while not exiting() and self.running and i <= 32 do
i = i + 1
```
And I think we should sleep for more times when fetch is error.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736940923
> Some interesting results:
>
> 1. every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value:
> https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215
> 2. requests are in batch. We can add some random number to smooth the auth requests.
>
> Some strange results:
>
> 1. the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
> 2. Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.
Do you find the caller, who is frequently call the request? Any way we should control the auth fail hander , we can not allow request to auth again and again , we must protected the etcd not recevie too many request
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736943542
> Do you find the caller, who is frequently call the request?
As I said, I can't reproduce it.
> the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
You need to provide the backtrack.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736187041
> > @Yiyiyimu @nic-chen welcome to take a look at this issue
>
> I am not sure different instances use one account can cause etcd auth problem?
I read the etcd doc: I read the etcd doc ref: https://etcd.io/docs/v3.4.0/learning/design-auth-v3/
```
There are two kinds of token types: simple and JWT. The simple token isn’t designed for production use cases. Its tokens aren’t cryptographically signed and servers must statefully track token-user correspondence; it is meant for development testing. JWT tokens should be used for production deployments since it is cryptographically signed and verified. From the implementation perspective, JWT is stateless. Its token can include metadata including username and revision, so servers don’t need to remember correspondence between tokens and the metadata.
```
Jwt is stateless and It can auth for every request, This is not a problem.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-739621994
> they are 16 batch job in on worker, this confused me ,it should be 8.
The logs come from two different process, one is worker and the other is privilege agent. You can confirm it via the different pids.
I have submitted https://github.com/apache/apisix/pull/2977, which is based on your PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735796493
> what is the error log at APISIX?
>
> we can sleep more time if failed to refresh JWT auth.
Apisix render this error, I think it because of etcd is busy to handle request.
```
020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/proto, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/global_rules, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/services, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/ssl, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/plugin_metadata, context: ngx.timer
```
This is very weird, the auth is right, what tigger jwt refresh crazy?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736200736
> After I check the apisix code, I find this code can cause dead loop auth, when sometimes jwt refresh is fail.
I don't think there is a dead loop. If dead loop exists, the request internal won't be tens of ms. It will be much shorter. Can you provide a standalone example to prove your suspect?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737158868
I try to use bad user and I find too many auth request can cause etcd cpu 100% . Apisx render too many auth error in etcd_config.lua ,ref : https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L509
I think the _automatic_fetch may be has problem, ref: https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L486
### etcd cpu 100%
![image](https://user-images.githubusercontent.com/15153469/100864332-d1a49c00-34d0-11eb-8e44-794f4369216c.png)
### etcd error log
![image](https://user-images.githubusercontent.com/15153469/100864479-01ec3a80-34d1-11eb-96c9-fbba4c58225b.png)
### apisix error log
![image](https://user-images.githubusercontent.com/15153469/100864536-13354700-34d1-11eb-8dc2-b0c152d6ed99.png)
@spacewander You can use my case to reproduce it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737650987
Fore every auth retch, they will do 32 times fetch from this code
```
while not exiting() and self.running and i <= 32 do
i = i + 1
```
I think this "ok" var will always true, if we do auth fail, and it can not sleep 3s.
```
if not ok then
log.error("failed to fetch data from etcd: ", err, ", ",
tostring(self))
ngx_sleep(3)
break
end
```
It will sleep 0.5 s from this code when auth is fail.
```
if err then
if err ~= "timeout" and err ~= "Key not found"
and self.last_err ~= err then
log.error("failed to fetch data from etcd: ", err, ", ",
tostring(self))
end
if err ~= self.last_err then
self.last_err = err
self.last_err_time = ngx_time()
else
if ngx_time() - self.last_err_time >= 30 then
self.last_err = nil
end
end
ngx_sleep(0.5)
```
they are 16 batch job in on worker, this confused me ,it should be 8.
```
2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
```
for one worker, they are 16*2=32 requests (max) in one second when auth is fail, I think it is an problem.I think we should be remove this code , we can receive config is not sync immediately.
```
while not exiting() and self.running and i <= 32 do
i = i + 1
```
And I think we should sleep for more times when fetch is error.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735796493
> what is the error log at APISIX?
>
> we can sleep more time if failed to refresh JWT auth.
Apisix render this error, I think it because of etcd is busy to handle request.
```
020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/proto, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/global_rules, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/services, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/ssl, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/plugin_metadata, context: ngx.timer
```
This is very weird, the auth is right, what tigger jwt refresh crazy?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736953972
> > Do you find the caller, who is frequently call the request?
>
> As I said, I can't reproduce it.
>
> > the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
>
> You need to provide the backtrack.
yeah, I am try to find it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736283075
@gy09535
How many workers per APISIX instance?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737241988
> I was mislead by the error log screenshots. The two error log are in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.
hahaha, ignore it, this code is ok, I will try to find why etcd receive 40 auth fail in a second.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737229032
> 30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.
@spacewander
I change the worker to 1 to do this test, 1 worker has 30 requests and 4 workers should has 120 ~ request.I think it is an problem, if we deploy more instances the request can be magnified.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736940923
> Some interesting results:
>
> 1. every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value:
> https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215
> 2. requests are in batch. We can add some random number to smooth the auth requests.
>
> Some strange results:
>
> 1. the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
> 2. Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.
Do you find the caller, who is frequently call the request? Any way we should control the auth fail hander , we can not allow request to auth again and again , we must protected the etcd not recevie too many request.
Some special condition can cause it ,not every time ,When it happened I restart the gateway the auth is normal, and the etcd request become normal.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737650987
Fore every auth retch, they will do 32 times fetch from this code
```
while not exiting() and self.running and i <= 32 do
i = i + 1
```
I think this "ok" var will always true, if we do auth fail, and it can not sleep 3s.
```
if not ok then
log.error("failed to fetch data from etcd: ", err, ", ",
tostring(self))
ngx_sleep(3)
break
end
```
It will sleep 0.5 s from this code
```
if err then
if err ~= "timeout" and err ~= "Key not found"
and self.last_err ~= err then
log.error("failed to fetch data from etcd: ", err, ", ",
tostring(self))
end
if err ~= self.last_err then
self.last_err = err
self.last_err_time = ngx_time()
else
if ngx_time() - self.last_err_time >= 30 then
self.last_err = nil
end
end
ngx_sleep(0.5)
```
they are 16 batch job in on worker, this confused me ,it should be 8.
```
2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
```
for one worker, they are 16*2=32 requests (max) in on seconds , why we try to fetch 32 times in on auth fetch, I think we should remove it.
```
while not exiting() and self.running and i <= 32 do
i = i + 1
```
And I think we should sleep for more times when fetch is error.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737169117
etcd cpu 100% and 30~ requests per second don't look strange to me. They are expected.
I have said it yesterday, "the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired."
30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.
We already waste too much time on the red herrings.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737239317
I was mislead by the error log screenshots. The two error log is in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737650987
Fore every auth retch, they will do 32 times fetch from this code
```
while not exiting() and self.running and i <= 32 do
i = i + 1
```
I think this "ok" var will always true, if we do auth fail, and it can not sleep 3s.
```
if not ok then
log.error("failed to fetch data from etcd: ", err, ", ",
tostring(self))
ngx_sleep(3)
break
end
```
It will sleep 0.5 s from this code
```
if err then
if err ~= "timeout" and err ~= "Key not found"
and self.last_err ~= err then
log.error("failed to fetch data from etcd: ", err, ", ",
tostring(self))
end
if err ~= self.last_err then
self.last_err = err
self.last_err_time = ngx_time()
else
if ngx_time() - self.last_err_time >= 30 then
self.last_err = nil
end
end
ngx_sleep(0.5)
```
they are 16 batch job in on worker, this confused me ,it should be 8.
```
2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
```
for one worker, they are 16*2=32 requests (max) in one second , why we try to fetch 32 times in on auth fetch, I think we should remove it.
```
while not exiting() and self.running and i <= 32 do
i = i + 1
```
And I think we should sleep for more times when fetch is error.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] membphis commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
membphis commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735764003
@Yiyiyimu @nic-chen welcome to take a look at this issue
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737229032
> 30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.
@spacewander
I change the worker to 1 to do this test, 1 worker has 30 requests and 4 workers should has 120 ~ request.I think it is an problem, if we deploy more instances the request can be magnified.
![image](https://user-images.githubusercontent.com/15153469/100878438-26eaa880-34e5-11eb-8c8e-b33e185a05cb.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736213624
> @gy09535
> Can you provide a packet capture file so we can analyze it?
[tcp_etcd_01.cap.zip](https://github.com/apache/apisix/files/5620286/tcp_etcd_01.cap.zip)
```
Etcd version : 3.4.13
Etcd roles: root , gateway
gateway role permission: /apisix/* /routes
Etcd user : apisix(role gateway)
```
Use one user with 3 apisix instances and sometimes can find this issue.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] membphis commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
membphis commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735763734
what is the error log at APISIX?
we can sleep more time if failed to refresh JWT auth.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736510902
Some interesting results:
1. every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value:
https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215
2. requests are in batch. We can add some random number to smooth the auth requests.
Some strange results:
1. the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
2. Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737248967
Anyway, I just submit an ongoing PR: https://github.com/apache/apisix/pull/2932. The PR can reduce the auth request significantly. Haven't finished for some details and tests yet.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737713053
After I change the sync code ,the etcd auth fail become normal, and the etcd cpu become normal. I think fix this issue should in to direction.
one as @spacewander said we can share etcd client to do request .
And we should also control sync data frequency, we can sync every 10 s for peer dir. And we should add the random time delay for auth_refresh.
After I change the sync code and with error auth ,the etcd render :
![image](https://user-images.githubusercontent.com/15153469/100977007-50561380-357b-11eb-8ec9-17d32d9dde83.png)
![image](https://user-images.githubusercontent.com/15153469/100977027-5815b800-357b-11eb-9062-bf870e2aa4a4.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736213624
> @gy09535
> Can you provide a packet capture file so we can analyze it?
[tcp_etcd_01.cap.zip](https://github.com/apache/apisix/files/5620286/tcp_etcd_01.cap.zip)
```
Etcd version : 3.4.13
Etcd roles: root , gateway
gateway role permission: /apisix/* /routes
Etcd user : apisix(role gateway)
```
Use one user with 3 apisix instances and sometimes can find this issue,I am not sure what tigger it with no obviously error log in log file.I think tigger this is some api called too frequent such as watcher. So change the auth logical can avoid it. I am try to prove it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737713053
After I change the sync code ,the etcd auth fail become normal, and the etcd cpu become normal. I think fix this issue should in to direction.
one as @spacewander said we can share etcd client to do request .
And we should also control sync data frequency, we can sync every 10 s for peer dir. And we should add the random time delay for auth_refresh.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736305847
> @gy09535
> How many workers per APISIX instance?
4
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737164323
@gy09535
Please provide the backtrack.
And please don't guess! If you think somewhere has problem, give a prove!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736201398
@gy09535
Can you provide a packet capture file so we can analyze it?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737158868
I try to use bad user and I find too many auth request can cause etcd cpu 100% . Apisx render too many auth error in etcd_config.lua ,ref : https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L509
I think the _automatic_fetch may be has problem, ref: https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L486
### etcd cpu 100%
![image](https://user-images.githubusercontent.com/15153469/100864332-d1a49c00-34d0-11eb-8e44-794f4369216c.png)
### etcd error log
![image](https://user-images.githubusercontent.com/15153469/100864479-01ec3a80-34d1-11eb-96c9-fbba4c58225b.png)
### apisix error log
![image](https://user-images.githubusercontent.com/15153469/100864536-13354700-34d1-11eb-8dc2-b0c152d6ed99.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735796493
> what is the error log at APISIX?
>
> we can sleep more time if failed to refresh JWT auth.
Apisix render this error, I think it because of etcd is busy to handle request.
```
020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/proto, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/global_rules, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/services, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/ssl, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/plugin_metadata, context: ngx.timer
```
This is very weird, the auth is right, what tigger jwt refresh crazy.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-739750207
> > they are 16 batch job in on worker, this confused me ,it should be 8.
>
> The logs come from two different process, one is worker and the other is privilege agent. You can confirm it via the different pids.
>
> I have submitted #2977, which is based on your PR.
Got it, thanks for your professional responses.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737713053
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737239317
I was mislead by the error log screenshots. The two error log are in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735796493
> what is the error log at APISIX?
>
> we can sleep more time if failed to refresh JWT auth.
Apisix render this error, I think it because of etcd is busy to handle request.
```
020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/proto, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/global_rules, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/services, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/ssl, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer, etcd key: /apisix/plugin_metadata, context: ngx.timer
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736187041
> > @Yiyiyimu @nic-chen welcome to take a look at this issue
>
> I am not sure different instances use one account can cause etcd auth problem?
I read the etcd doc ref: https://etcd.io/docs/v3.4.0/learning/design-auth-v3/
```
There are two kinds of token types: simple and JWT. The simple token isn’t designed for production use cases. Its tokens aren’t cryptographically signed and servers must statefully track token-user correspondence; it is meant for development testing. JWT tokens should be used for production deployments since it is cryptographically signed and verified. From the implementation perspective, JWT is stateless. Its token can include metadata including username and revision, so servers don’t need to remember correspondence between tokens and the metadata.
```
Jwt is stateless and It can auth for every request, This is not a problem.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736170112
> @Yiyiyimu @nic-chen welcome to take a look at this issue
I am not sure different instances use one account can cause etcd auth problem?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736213624
> @gy09535
> Can you provide a packet capture file so we can analyze it?
[tcp_etcd_01.cap.zip](https://github.com/apache/apisix/files/5620286/tcp_etcd_01.cap.zip)
```
Etcd version : 3.4.13
Etcd roles: root , gateway
gateway role permission: /apisix/* /routes
Etcd user : apisix(role gateway)
```
Use one user with 3 apisix instances and sometimes can find this issue,I am not sure what tigger it with no obviously error log in log file.I think tigger this is some api called too frequent. So change the auth logical can avoid it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737158868
I try to use bad user and I find too many auth request can cause etcd cpu 100% . Apisx render too many auth error in etcd_config.lua ,ref : https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L509
I think the _automatic_fetch may be has problem, ref: https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L486
#### etcd cpu 100%
![image](https://user-images.githubusercontent.com/15153469/100864332-d1a49c00-34d0-11eb-8e44-794f4369216c.png)
### etcd error log
![image](https://user-images.githubusercontent.com/15153469/100864479-01ec3a80-34d1-11eb-96c9-fbba4c58225b.png)
### apisix error log
![image](https://user-images.githubusercontent.com/15153469/100864536-13354700-34d1-11eb-8dc2-b0c152d6ed99.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737180947
Anyway, we can share the etcd client, to reduce the auth requests. Although there may still be `2.5x` problem and `etcd possible hang` problem, this optimization can fix the etcd 100% and sync failure. (Who mind 2.5x problem if we only have 2 requests per worker?)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [apisix] gy09535 closed issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.
Posted by GitBox <gi...@apache.org>.
gy09535 closed issue #2899:
URL: https://github.com/apache/apisix/issues/2899
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org