You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@apisix.apache.org by GitBox <gi...@apache.org> on 2020/11/30 12:00:18 UTC

[GitHub] [apisix] gy09535 opened a new issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

gy09535 opened a new issue #2899:
URL: https://github.com/apache/apisix/issues/2899


   ### Issue description
   
   Today I find some auth error from etcd, I try to connect etcd to find the error, I find so many auth request cause etcd api timeout, this is my etcd logs:
   ```
   2020-11-30 19:55:01.150181 W | auth: invalid auth token: SDZBJcAaVedJqUFY.116415
   2020-11-30 19:55:01.226868 W | auth: invalid auth token: MNtikXEJuXkIOWVt.114382
   2020-11-30 19:55:01.238848 W | auth: invalid auth token: ArWxwPXGQRclPdZp.114301
   2020-11-30 19:55:01.254654 W | auth: invalid auth token: WuqYtmtIDCGftGwr.114315
   2020-11-30 19:55:01.283579 W | auth: invalid auth token: CqdVqeYChuiEWgWh.114324
   2020-11-30 19:55:01.320021 W | auth: invalid auth token: YHlQCFrmBrGsHXXV.116397
   2020-11-30 19:55:01.320712 W | auth: invalid auth token: YyOUXXpgpqQyNryf.114384
   2020-11-30 19:55:01.359229 W | auth: invalid auth token: YXSKKPuaGObcCrFZ.114346
   2020-11-30 19:55:01.367728 W | auth: invalid auth token: HFqBwCvzPfTRVxih.116421
   2020-11-30 19:55:01.389121 W | auth: invalid auth token: KGVyAgnXWOgFDFYl.114375
   2020-11-30 19:55:01.408058 W | auth: invalid auth token: EGSajOVclqMhLIqS.114317
   2020-11-30 19:55:01.424942 W | auth: invalid auth token: qtXATtyhLZJSGIfB.116433
   2020-11-30 19:55:01.480119 W | auth: invalid auth token: tqHxDVbjSKBVCDaS.114331
   2020-11-30 19:55:01.485725 W | auth: invalid auth token: siQhqOvdtRUXsHiW.116425
   2020-11-30 19:55:01.521545 W | auth: invalid auth token: xVPFwHcEMXScHhGZ.114338
   2020-11-30 19:55:01.529845 W | auth: invalid auth token: vqtBsVIGKtNCjUpb.114336
   2020-11-30 19:55:01.541498 W | auth: invalid auth token: XFqNUGjGYYsHhpWI.114345
   2020-11-30 19:55:01.580206 W | auth: invalid auth token: CiBhQUGpwiUPczNo.114314
   2020-11-30 19:55:01.580919 W | auth: invalid auth token: JVaWKqKTDxnhwZig.114351
   2020-11-30 19:55:01.600957 W | auth: invalid auth token: OIdYBazTuoTeRyhV.114320
   2020-11-30 19:55:01.628087 W | auth: invalid auth token: bSuTMTUaydeOSFPw.114311
   2020-11-30 19:55:01.650066 W | auth: invalid auth token: CUImlKqMxLllTDAG.116422
   2020-11-30 19:55:01.671630 W | auth: invalid auth token: MckpSvnXCslOBZzN.114350
   2020-11-30 19:55:01.730228 W | auth: invalid auth token: ZoCzVsZVnxKkXgkh.114323
   2020-11-30 19:55:01.809137 W | auth: invalid auth token: MNtikXEJuXkIOWVt.114382
   2020-11-30 19:55:01.840882 W | auth: invalid auth token: eLrJyzLIfDCCtLgS.114352
   2020-11-30 19:55:01.892522 W | auth: invalid auth token: mprbhPNbjByDExCW.116398
   2020-11-30 19:55:01.973658 W | auth: invalid auth token: MMKVaKJYLCPjtYra.114335
   2020-11-30 19:55:02.055462 W | auth: invalid auth token: eLrJyzLIfDCCtLgS.114352
   2020-11-30 19:55:02.313522 W | auth: invalid auth token: NcKCHbtvAvQTJWap.116409
   2020-11-30 19:55:02.348031 W | auth: invalid auth token: iQkxDrTcVsWzzBwz.114330
   2020-11-30 19:55:02.588947 W | auth: invalid auth token: THlTKQvjLFueVrsC.116408
   2020-11-30 19:55:02.620312 W | auth: invalid auth token: zEoNFjjLmZLFLujv.114328
   2020-11-30 19:55:02.637622 W | auth: invalid auth token: jmyLhQNCVaHCuESg.114312
   2020-11-30 19:55:02.760728 W | auth: invalid auth token: bloWjYzYVchKqYCb.114376
   ```
   
   I get package from  etcd serve and find this  packages:
   ![image](https://user-images.githubusercontent.com/15153469/100607748-57481080-3346-11eb-9067-731603410649.png)
   
   After I check the apisix code, I find this code can cause dead loop auth.
   ```
   function refresh_jwt_token(self)
       -- token exist and not expire
       -- default is 5min, we use 3min
       -- https://github.com/etcd-io/etcd/issues/8287
       if self.jwt_token and now() - self.last_auth_time < 60 * 3 then
           return true, nil
       end
   
       local opts = {
           body = {
               name         = self.user,
               password     = self.password,
           }
       }
       local res, err = _request_uri(self, 'POST',
                           choose_endpoint(self).full_prefix .. "/auth/authenticate",
                           opts, 5, true)    -- default authenticate timeout 5 second
       if err then
           return nil, err
       end
   
       if not res or not res.body or not res.body.token then
           return nil, 'authenticate refresh token fail'
       end
   
       self.jwt_token = res.body.token
       self.last_auth_time = now()
   
       return true, nil
   end
   ```
   ref: https://github.com/api7/lua-resty-etcd/blob/master/lib/resty/etcd/v3.lua#L221
   ### Environment
   
   * apisix version (cmd: `apisix version`):
   * OS:
   
   ### Minimal test code / Steps to reproduce the issue
   1.
   2.
   3.
   
   ### What's the actual result? (including assertion message & call stack if applicable)
   
   
   
   ### What's the expected result?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737650987


   Fore every auth retch,  they will do 32 times fetch from this code
   ```
    while not exiting() and self.running and i <= 32 do
           i = i + 1
   ```
   
   I think this ok will always true, if we do auth fail, and it can not sleep 3s. 
   
   ```
    if not ok then
               log.error("failed to fetch data from etcd: ", err, ", ",
                         tostring(self))
               ngx_sleep(3)
               break
           end
   ```
   It will sleep 0.5 s  from this code 
   ```
               if err then
                   if err ~= "timeout" and err ~= "Key not found"
                       and self.last_err ~= err then
                       log.error("failed to fetch data from etcd: ", err, ", ",
                                 tostring(self))
                   end
   
                   if err ~= self.last_err then
                       self.last_err = err
                       self.last_err_time = ngx_time()
                   else
                       if ngx_time() - self.last_err_time >= 30 then
                           self.last_err = nil
                       end
                   end
                   ngx_sleep(0.5)
   ```
   they are 16 batch job in on worker, this confused me ,it should be 8. 
   ```
   2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
   ```
   for one worker, they are 16*2=32 requests (max) in on seconds , why we try to fetch 32 times in on auth fetch, I think we should remove it.
   ```
    while not exiting() and self.running and i <= 32 do
           i = i + 1
   ```
   
   And I think we should sleep for more times when fetch is error.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736940923


   > Some interesting results:
   > 
   > 1. every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value:
   >    https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215
   > 2. requests are in batch. We can add some random number to smooth the auth requests.
   > 
   > Some strange results:
   > 
   > 1. the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
   > 2. Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.
   
   Do you find the caller, who is frequently call the request? Any way we should control the auth fail hander , we can not allow request to auth again and again , we must protected the etcd not recevie too many request  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736943542


   > Do you find the caller, who is frequently call the request?
   
   As I said, I can't reproduce it.
   
   > the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
   
   You need to provide the backtrack.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736187041


   > > @Yiyiyimu @nic-chen welcome to take a look at this issue
   > 
   > I am not sure different instances use one account can cause etcd auth problem?
   
   I read the etcd doc: I read the etcd doc ref: https://etcd.io/docs/v3.4.0/learning/design-auth-v3/ 
   ```
   There are two kinds of token types: simple and JWT. The simple token isn’t designed for production use cases. Its tokens aren’t cryptographically signed and servers must statefully track token-user correspondence; it is meant for development testing. JWT tokens should be used for production deployments since it is cryptographically signed and verified. From the implementation perspective, JWT is stateless. Its token can include metadata including username and revision, so servers don’t need to remember correspondence between tokens and the metadata.
   ```
   Jwt is stateless and It can auth for every request, This is not a problem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-739621994


   > they are 16 batch job in on worker, this confused me ,it should be 8.
   
   The logs come from two different process, one is worker and the other is privilege agent. You can confirm it via the different pids.
   
   I have submitted https://github.com/apache/apisix/pull/2977, which is based on your PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735796493


   > what is the error log at APISIX?
   > 
   > we can sleep more time if failed to refresh JWT auth.
   Apisix render this error, I think it because of etcd is busy to handle request.
   ```
   020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/proto, context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/global_rules, context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/services, context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/ssl, context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/plugin_metadata, context: ngx.timer
   ```
   
   This is very weird, the auth is right, what tigger jwt refresh crazy?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736200736


   > After I check the apisix code, I find this code can cause dead loop auth, when sometimes jwt refresh is fail.
   
   I don't think there is a dead loop. If dead loop exists, the request internal won't be tens of ms. It will be much shorter. Can you provide a standalone example to prove your suspect?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737158868


   I try to use bad user and I find too many auth request can cause etcd cpu 100% . Apisx  render too many auth error in etcd_config.lua ,ref : https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L509
   I think the _automatic_fetch may be has problem, ref: https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L486
   
   ### etcd cpu 100%
   ![image](https://user-images.githubusercontent.com/15153469/100864332-d1a49c00-34d0-11eb-8e44-794f4369216c.png)
   
   ### etcd error log 
   ![image](https://user-images.githubusercontent.com/15153469/100864479-01ec3a80-34d1-11eb-96c9-fbba4c58225b.png)
   
   ### apisix error log 
   ![image](https://user-images.githubusercontent.com/15153469/100864536-13354700-34d1-11eb-8dc2-b0c152d6ed99.png)
   
   @spacewander  You can use my case to reproduce it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737650987


   Fore every auth retch,  they will do 32 times fetch from this code
   ```
    while not exiting() and self.running and i <= 32 do
           i = i + 1
   ```
   
   I think this "ok" var will always true, if we do auth fail, and it can not sleep 3s. 
   
   ```
    if not ok then
               log.error("failed to fetch data from etcd: ", err, ", ",
                         tostring(self))
               ngx_sleep(3)
               break
           end
   ```
   It will sleep 0.5 s  from this code when auth is fail.
   ```
               if err then
                   if err ~= "timeout" and err ~= "Key not found"
                       and self.last_err ~= err then
                       log.error("failed to fetch data from etcd: ", err, ", ",
                                 tostring(self))
                   end
   
                   if err ~= self.last_err then
                       self.last_err = err
                       self.last_err_time = ngx_time()
                   else
                       if ngx_time() - self.last_err_time >= 30 then
                           self.last_err = nil
                       end
                   end
                   ngx_sleep(0.5)
   ```
   they are 16 batch job in on worker, this confused me ,it should be 8. 
   ```
   2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
   ```
   for one worker, they are 16*2=32 requests (max) in one second  when auth is fail, I think it is an problem.I  think we should be remove this code , we can receive  config is not sync immediately.
   ```
    while not exiting() and self.running and i <= 32 do
           i = i + 1
   ```
   
   And I think we should sleep for more times when fetch is error.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735796493


   > what is the error log at APISIX?
   > 
   > we can sleep more time if failed to refresh JWT auth.
   
   
   Apisix render this error, I think it because of etcd is busy to handle request.
   ```
   020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/proto, context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/global_rules, context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/services, context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/ssl, context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/plugin_metadata, context: ngx.timer
   ```
   
   This is very weird, the auth is right, what tigger jwt refresh crazy?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736953972


   > > Do you find the caller, who is frequently call the request?
   > 
   > As I said, I can't reproduce it.
   > 
   > > the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
   > 
   > You need to provide the backtrack.
   
   yeah, I am try to find it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736283075


   @gy09535 
   How many workers per APISIX instance?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737241988


   > I was mislead by the error log screenshots. The two error log are in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.
   
   hahaha, ignore it, this code is ok, I will try to find why etcd receive 40 auth fail in a second. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737229032


   > 30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.
   
   @spacewander 
   I change the worker to 1 to do this test, 1 worker has 30 requests and  4 workers should has  120 ~ request.I think it is an problem, if we deploy more instances the request can be magnified.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736940923


   > Some interesting results:
   > 
   > 1. every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value:
   >    https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215
   > 2. requests are in batch. We can add some random number to smooth the auth requests.
   > 
   > Some strange results:
   > 
   > 1. the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
   > 2. Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.
   
   Do you find the caller, who is frequently call the request? Any way we should control the auth fail hander , we can not allow request to auth again and again , we must protected the etcd not recevie too many request.
   
   Some special condition can cause it ,not every time ,When it happened I restart the gateway  the auth is normal, and the etcd request become normal.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737650987


   Fore every auth retch,  they will do 32 times fetch from this code
   ```
    while not exiting() and self.running and i <= 32 do
           i = i + 1
   ```
   
   I think this "ok" var will always true, if we do auth fail, and it can not sleep 3s. 
   
   ```
    if not ok then
               log.error("failed to fetch data from etcd: ", err, ", ",
                         tostring(self))
               ngx_sleep(3)
               break
           end
   ```
   It will sleep 0.5 s  from this code 
   ```
               if err then
                   if err ~= "timeout" and err ~= "Key not found"
                       and self.last_err ~= err then
                       log.error("failed to fetch data from etcd: ", err, ", ",
                                 tostring(self))
                   end
   
                   if err ~= self.last_err then
                       self.last_err = err
                       self.last_err_time = ngx_time()
                   else
                       if ngx_time() - self.last_err_time >= 30 then
                           self.last_err = nil
                       end
                   end
                   ngx_sleep(0.5)
   ```
   they are 16 batch job in on worker, this confused me ,it should be 8. 
   ```
   2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
   ```
   for one worker, they are 16*2=32 requests (max) in on seconds , why we try to fetch 32 times in on auth fetch, I think we should remove it.
   ```
    while not exiting() and self.running and i <= 32 do
           i = i + 1
   ```
   
   And I think we should sleep for more times when fetch is error.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737169117


   etcd cpu 100% and 30~ requests per second don't look strange to me. They are expected.
   
   I have said it yesterday, "the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired."
   
   30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.
   
   We already waste too much time on the red herrings.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737239317


   I was mislead by the error log screenshots. The two error log is in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737650987


   Fore every auth retch,  they will do 32 times fetch from this code
   ```
    while not exiting() and self.running and i <= 32 do
           i = i + 1
   ```
   
   I think this "ok" var will always true, if we do auth fail, and it can not sleep 3s. 
   
   ```
    if not ok then
               log.error("failed to fetch data from etcd: ", err, ", ",
                         tostring(self))
               ngx_sleep(3)
               break
           end
   ```
   It will sleep 0.5 s  from this code 
   ```
               if err then
                   if err ~= "timeout" and err ~= "Key not found"
                       and self.last_err ~= err then
                       log.error("failed to fetch data from etcd: ", err, ", ",
                                 tostring(self))
                   end
   
                   if err ~= self.last_err then
                       self.last_err = err
                       self.last_err_time = ngx_time()
                   else
                       if ngx_time() - self.last_err_time >= 30 then
                           self.last_err = nil
                       end
                   end
                   ngx_sleep(0.5)
   ```
   they are 16 batch job in on worker, this confused me ,it should be 8. 
   ```
   2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
   2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
   ```
   for one worker, they are 16*2=32 requests (max) in one second , why we try to fetch 32 times in on auth fetch, I think we should remove it.
   ```
    while not exiting() and self.running and i <= 32 do
           i = i + 1
   ```
   
   And I think we should sleep for more times when fetch is error.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] membphis commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
membphis commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735764003


   @Yiyiyimu @nic-chen welcome to take a look at this issue


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737229032


   > 30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.
   
   @spacewander 
   I change the worker to 1 to do this test, 1 worker has 30 requests and  4 workers should has  120 ~ request.I think it is an problem, if we deploy more instances the request can be magnified.
   ![image](https://user-images.githubusercontent.com/15153469/100878438-26eaa880-34e5-11eb-8c8e-b33e185a05cb.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736213624


   > @gy09535
   > Can you provide a packet capture file so we can analyze it?
   
   [tcp_etcd_01.cap.zip](https://github.com/apache/apisix/files/5620286/tcp_etcd_01.cap.zip)
   
   ```
   Etcd version : 3.4.13
   Etcd roles:  root , gateway
   gateway role permission: /apisix/*   /routes
   Etcd user : apisix(role  gateway)
   ```
   Use  one user with  3  apisix instances and sometimes can find this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] membphis commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
membphis commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735763734


   what is the error log at APISIX? 
   
   we can sleep more time if failed to refresh JWT auth.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736510902


   Some interesting results:
   1. every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value: 
   https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215
   2. requests are in batch. We can add some random number to smooth the auth requests.
   
   Some strange results:
   1. the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
   2. Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737248967


   Anyway, I just submit an ongoing PR: https://github.com/apache/apisix/pull/2932. The PR can reduce the auth request significantly. Haven't finished for some details and tests yet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737713053


   After I change the sync code ,the etcd auth fail become normal, and the etcd cpu become normal. I think fix this issue should in to direction. 
   one as @spacewander  said we can share etcd client to do request .
   And we should also control sync data frequency, we can sync every 10 s for peer dir. And we should add the random time delay for auth_refresh.
   
   After I change the sync code  and with error auth ,the etcd render :
   ![image](https://user-images.githubusercontent.com/15153469/100977007-50561380-357b-11eb-8ec9-17d32d9dde83.png)
   ![image](https://user-images.githubusercontent.com/15153469/100977027-5815b800-357b-11eb-9062-bf870e2aa4a4.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736213624


   > @gy09535
   > Can you provide a packet capture file so we can analyze it?
   
   [tcp_etcd_01.cap.zip](https://github.com/apache/apisix/files/5620286/tcp_etcd_01.cap.zip)
   
   ```
   Etcd version : 3.4.13
   Etcd roles:  root , gateway
   gateway role permission: /apisix/*   /routes
   Etcd user : apisix(role  gateway)
   ```
   Use  one user with  3  apisix instances and sometimes can find this issue,I am not sure what tigger it with no  obviously error log in log file.I think tigger this  is some api called too frequent such as watcher. So change the auth logical can avoid it. I am try to prove it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737713053


   After I change the sync code ,the etcd auth fail become normal, and the etcd cpu become normal. I think fix this issue should in to direction. 
   one as @spacewander  said we can share etcd client to do request .
   And we should also control sync data frequency, we can sync every 10 s for peer dir. And we should add the random time delay for auth_refresh.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736305847


   > @gy09535
   > How many workers per APISIX instance?
   
   4


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737164323


   @gy09535 
   Please provide the backtrack.
   And please don't guess! If you think somewhere has problem, give a prove!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736201398


   @gy09535 
   Can you provide a packet capture file so we can analyze it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737158868


   I try to use bad user and I find too many auth request can cause etcd cpu 100% . Apisx  render too many auth error in etcd_config.lua ,ref : https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L509
   I think the _automatic_fetch may be has problem, ref: https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L486
   
   ### etcd cpu 100%
   ![image](https://user-images.githubusercontent.com/15153469/100864332-d1a49c00-34d0-11eb-8e44-794f4369216c.png)
   
   ### etcd error log 
   ![image](https://user-images.githubusercontent.com/15153469/100864479-01ec3a80-34d1-11eb-96c9-fbba4c58225b.png)
   
   ### apisix error log 
   ![image](https://user-images.githubusercontent.com/15153469/100864536-13354700-34d1-11eb-8dc2-b0c152d6ed99.png)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735796493


   > what is the error log at APISIX?
   > 
   > we can sleep more time if failed to refresh JWT auth.
   Apisix render this error, I think it because of etcd is busy to handle request.
   ```
   020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/proto, context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/global_rules, context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/services, context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/ssl, context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/plugin_metadata, context: ngx.timer
   ```
   
   This is very weird, the auth is right, what tigger jwt refresh crazy.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-739750207


   > > they are 16 batch job in on worker, this confused me ,it should be 8.
   > 
   > The logs come from two different process, one is worker and the other is privilege agent. You can confirm it via the different pids.
   > 
   > I have submitted #2977, which is based on your PR.
   
   Got it, thanks  for your professional responses.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737713053






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737239317


   I was mislead by the error log screenshots. The two error log are in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-735796493


   > what is the error log at APISIX?
   > 
   > we can sleep more time if failed to refresh JWT auth.
   Apisix render this error, I think it because of etcd is busy to handle request.
   ```
   020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/proto, context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/global_rules, context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/services, context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/ssl, context: ngx.timer
   2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
   2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/plugin_metadata, context: ngx.timer
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736187041


   > > @Yiyiyimu @nic-chen welcome to take a look at this issue
   > 
   > I am not sure different instances use one account can cause etcd auth problem?
   
    I read the etcd doc ref: https://etcd.io/docs/v3.4.0/learning/design-auth-v3/ 
   ```
   There are two kinds of token types: simple and JWT. The simple token isn’t designed for production use cases. Its tokens aren’t cryptographically signed and servers must statefully track token-user correspondence; it is meant for development testing. JWT tokens should be used for production deployments since it is cryptographically signed and verified. From the implementation perspective, JWT is stateless. Its token can include metadata including username and revision, so servers don’t need to remember correspondence between tokens and the metadata.
   ```
   Jwt is stateless and It can auth for every request, This is not a problem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736170112


   > @Yiyiyimu @nic-chen welcome to take a look at this issue
   
   I am not sure different instances use one account can cause etcd auth problem?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 edited a comment on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 edited a comment on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-736213624


   > @gy09535
   > Can you provide a packet capture file so we can analyze it?
   
   [tcp_etcd_01.cap.zip](https://github.com/apache/apisix/files/5620286/tcp_etcd_01.cap.zip)
   
   ```
   Etcd version : 3.4.13
   Etcd roles:  root , gateway
   gateway role permission: /apisix/*   /routes
   Etcd user : apisix(role  gateway)
   ```
   Use  one user with  3  apisix instances and sometimes can find this issue,I am not sure what tigger it with no  obviously error log in log file.I think tigger this  is some api called too frequent. So change the auth logical can avoid it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737158868


   I try to use bad user and I find too many auth request can cause etcd cpu 100% . Apisx  render too many auth error in etcd_config.lua ,ref : https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L509
   I think the _automatic_fetch may be has problem, ref: https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L486
   
   #### etcd cpu 100%
   ![image](https://user-images.githubusercontent.com/15153469/100864332-d1a49c00-34d0-11eb-8e44-794f4369216c.png)
   
   ### etcd error log 
   ![image](https://user-images.githubusercontent.com/15153469/100864479-01ec3a80-34d1-11eb-96c9-fbba4c58225b.png)
   
   ### apisix error log 
   ![image](https://user-images.githubusercontent.com/15153469/100864536-13354700-34d1-11eb-8dc2-b0c152d6ed99.png)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] spacewander commented on issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
spacewander commented on issue #2899:
URL: https://github.com/apache/apisix/issues/2899#issuecomment-737180947


   Anyway, we can share the etcd client, to reduce the auth requests. Although there may still be `2.5x` problem and `etcd possible hang` problem, this optimization can fix the etcd 100% and sync failure. (Who mind 2.5x problem if we only have 2 requests per worker?)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [apisix] gy09535 closed issue #2899: bug: Etcd jwt refresh error can cause gateway cluster crash.

Posted by GitBox <gi...@apache.org>.
gy09535 closed issue #2899:
URL: https://github.com/apache/apisix/issues/2899


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org