You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/12/16 12:48:28 UTC

[GitHub] [pulsar] suiyuzeng opened a new issue #13363: Producer registe failed when reconnecting

suiyuzeng opened a new issue #13363:
URL: https://github.com/apache/pulsar/issues/13363


   **Describe the bug**
      Producer registe failed and retry still failed when reconnecting after keep-alive timeout. Logs as follows:
   ```
   2021-12-14 15:11:07,284 [ INFO ] ProducerImpl - [persistent://public/default/level2-pressure-1-partition-48] [level2-54-188] Created producer on cnx [id: 0x92ee3683, L:/aa.aa.aa.aa:34398
   - R:server.docker.ys/bb.bb.bb.bb:6650]
   2021-12-14 15:54:45,473 [ WARN ] PulsarHandler - [[id: 0x92ee3683, L:/aa.aa.aa.aa:34398 - R:server.docker.ys/bb.bb.bb.bb:6650]] Forcing connection to close after keep-alive timeout
   2021-12-14 15:54:45,507 [ INFO ] ClientCnx - [id: 0x92ee3683, L:/aa.aa.aa.aa:34398 ! R:server.docker.ys/bb.bb.bb.bb:6650] Disconnected
   2021-12-14 15:54:45,507 [ INFO ] ConnectionHandler - [persistent://public/default/level2-pressure-1-partition-48] [level2-54-188] Closed connection [id: 0x92ee3683, L:/aa.aa.aa.aa:34398 !
    R:server.docker.ys/bb.bb.bb.bb:6650] -- Will try again in 0.1 s
   2021-12-14 15:54:45,608 [ INFO ] ConnectionHandler - [persistent://public/default/level2-pressure-1-partition-48] [level2-54-188] Reconnecting after timeout
   2021-12-14 15:54:45,611 [ INFO ] ProducerImpl - [persistent://public/default/level2-pressure-1-partition-48] [level2-54-188] Creating producer on cnx [id: 0x3050a9d3, L:/aa.aa.aa.aa:34172 - R:server.docker.ys/bb.bb.bb.bb:6650]
   2021-12-14 15:54:45,630 [ WARN ] ClientCnx - [id: 0x3050a9d3, L:/aa.aa.aa.aa:34172 - R:server.docker.ys/bb.bb.bb.bb:6650] Received error from server: org.apache.pulsar.broker.service.BrokerServiceException$NamingException: Producer with name 'level2-54-188' is already connected to topic
   2021-12-14 15:54:45,630 [ ERROR ] ProducerImpl - [persistent://public/default/level2-pressure-1-partition-48] [level2-54-188] Failed to create producer: org.apache.pulsar.broker.service.BrokerServiceException$NamingException: Producer with name 'level2-54-188' is already connected to topic
   ```
   In client, it seems there is some problem about the network. Close the connection and retry. But the server send the error "Producer is already connected to topic" for earch retry.
   
   In the server log,  do not find the log about close event.
   2021-12-14 15:11:07 The server receive the registe info from the port 34398 for the first time. 
   2021-12-14 15:54:45 get the new connection from the same client with the port of 34172, and registe the same producer.
   ```
   2021-12-14 15:11:07,283 [ INFO ] ServerCnx - [/aa.aa.aa.aa:34398][persistent://public/default/level2-pressure-1-partition-48] Creating producer. producerId=48
   2021-12-14 15:11:07,283 [ INFO ] ServerCnx - [/aa.aa.aa.aa:34398] persistent://public/default/level2-pressure-1-partition-48 configured with schema false
   2021-12-14 15:11:07,283 [ INFO ] ServerCnx - [/aa.aa.aa.aa:34398] Created new producer: Producer{topic=PersistentTopic{topic=persistent://public/default/level2-pressure-1-partition-48}, client=/aa.aa.aa.aa:34398, producerName=level2-54-188, producerId=48}
   2021-12-14 15:54:45,610 [ INFO ] ServerCnx - New connection from /aa.aa.aa.aa:34172
   2021-12-14 15:54:45,629 [ INFO ] ServerCnx - [/aa.aa.aa.aa:34172][persistent://public/default/level2-pressure-1-partition-48] Creating producer. producerId=48
   2021-12-14 15:54:45,629 [ INFO ] ServerCnx - [/aa.aa.aa.aa:34172] persistent://public/default/level2-pressure-1-partition-48 configured with schema false
   2021-12-14 15:54:45,629 [ ERROR ] ServerCnx - [/aa.aa.aa.aa:34172] Failed to add producer to topic persistent://public/default/level2-pressure-1-partition-48: producerId=48, org.apache.pulsar.broker.service.BrokerServiceException$NamingException: Producer with name 'level2-54-188' is already connected to topic
   ```
   The server did not get the channel Inactive event about the port  34398 as there is no log about "Closed connection from".
   
   ```
   @Override
       public void channelInactive(ChannelHandlerContext ctx) throws Exception {
           super.channelInactive(ctx);
           connectionController.decreaseConnection(ctx.channel().remoteAddress());
           isActive = false;
           log.info("Closed connection from {}", remoteAddress); // there is no log about this.
           BrokerInterceptor brokerInterceptor = getBrokerService().getInterceptor();
           if (brokerInterceptor != null) {
               brokerInterceptor.onConnectionClosed(this);
           }
   ```
   And there is still a connection with the port 34398 in the server, but do not find the connection with the port in the client. It seem the FIN package of tcp do not reach the server.
   
   Another problem is that the keep-alive is not work in the server. If it work, the connection of the port of 34398 will close, so the producer can regist again.
   In org.apache.pulsar.common.protocol.PulsarHandler#handleKeepAliveTimeout, there is no log about the connection, even open debug level log. 
   
   There is a possibility that a exception is threw in org.apache.pulsar.common.protocol.PulsarHandler#handleKeepAliveTimeout. Then the schedule task will not run any more as there is no exeception catch. "ctx.writeAndFlush" may throw some RuntimeException as there is some problem about the network.
   
   The problem appear 2 times. And it does not appear any more after adding some logs.
   
   How about adding exception catch in In org.apache.pulsar.common.protocol.PulsarHandler#handleKeepAliveTimeout?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] github-actions[bot] commented on issue #13363: Producer registe failed when reconnecting

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #13363:
URL: https://github.com/apache/pulsar/issues/13363#issuecomment-1051438658


   The issue had no activity for 30 days, mark with Stale label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] Jason918 commented on issue #13363: Producer registe failed when reconnecting

Posted by GitBox <gi...@apache.org>.
Jason918 commented on issue #13363:
URL: https://github.com/apache/pulsar/issues/13363#issuecomment-996408912


   @michaeljmarshall  PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] suiyuzeng closed issue #13363: Producer registe failed when reconnecting

Posted by GitBox <gi...@apache.org>.
suiyuzeng closed issue #13363:
URL: https://github.com/apache/pulsar/issues/13363


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on issue #13363: Producer registe failed when reconnecting

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on issue #13363:
URL: https://github.com/apache/pulsar/issues/13363#issuecomment-996468881


   @suiyuzeng - thanks for opening this issue. I have been working in this part of the code base recently, so I am happy to take a look. I should be able to sometime in the next few days. I am going to assign this to myself for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on issue #13363: Producer registe failed when reconnecting

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on issue #13363:
URL: https://github.com/apache/pulsar/issues/13363#issuecomment-1082599037


   @suiyuzeng - you mention that you're running 2.8.2, but that version wasn't released when you opened this issue. Can you double check which version of broker and client where you're seeing this issue? Also, which client are you using (java, go, python, etc.)?
   
   > In client, it seems there is some problem about the network. Close the connection and retry. But the server send the error "Producer is already connected to topic" for earch retry.
   
   @lhotari observed this behavior and fixed a race condition just this week: https://github.com/apache/pulsar/pull/13428. That patch won't be live until 2.8.3.
   
   > 2021-12-14 15:11:07 The server receive the registe info from the port 34398 for the first time.
   2021-12-14 15:54:45 get the new connection from the same client with the port of 34172, and registe the same producer.
   
   Is it possible that you're creating two producers with the same name? 
   
   > There is a possibility that a exception is threw in org.apache.pulsar.common.protocol.PulsarHandler#handleKeepAliveTimeout. Then the schedule task will not run any more as there is no exeception catch. "ctx.writeAndFlush" may throw some RuntimeException as there is some problem about the network.
   
   This is solved by https://github.com/apache/pulsar/pull/12853, which will be included in 2.8.2.
   
   Now that 2.8.3 has been released, are you able to test using the latest version to see if the issue is still present?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] suiyuzeng commented on issue #13363: Producer registe failed when reconnecting

Posted by GitBox <gi...@apache.org>.
suiyuzeng commented on issue #13363:
URL: https://github.com/apache/pulsar/issues/13363#issuecomment-1082703611


   @michaeljmarshall We have changed to the release version. It is not present any more. Thanks very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org