You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/09/08 21:57:55 UTC

[GitHub] [pulsar-dotpulsar] usaguerrilla opened a new issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

usaguerrilla opened a new issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56


   If I add the following exceptions cases into default handler then consumer and producer can reconnect. IMHO handler is a dangerous things as if implementation or underlying .net changes and new exceptions are thrown we might end up with server applications which hang from time to time for no apparent reason.
   
   ```
   PersistenceException _ => FaultAction.Retry,
   IOException _ => FaultAction.Retry,
   ```
   
   These exception are throw if channel was sending data to the server at the time server went down.
   
   Full code:
   
   ```
   private FaultAction DetermineFaultAction(Exception exception, CancellationToken cancellationToken)
       => exception switch
       {
           PersistenceException _ => FaultAction.Retry,
           IOException _ => FaultAction.Retry,
           TooManyRequestsException _ => FaultAction.Retry,
           ChannelNotReadyException _ => FaultAction.Retry,
           ServiceNotReadyException _ => FaultAction.Retry,
           ConnectionDisposedException _ => FaultAction.Retry,
           AsyncLockDisposedException _ => FaultAction.Retry,
           PulsarStreamDisposedException _ => FaultAction.Retry,
           AsyncQueueDisposedException _ => FaultAction.Retry,
           OperationCanceledException _ => cancellationToken.IsCancellationRequested ? FaultAction.Rethrow : FaultAction.Retry,
           DotPulsarException _ => FaultAction.Rethrow,
           SocketException socketException => socketException.SocketErrorCode switch
           {
               SocketError.HostNotFound => FaultAction.Rethrow,
               SocketError.HostUnreachable => FaultAction.Rethrow,
               SocketError.NetworkUnreachable => FaultAction.Rethrow,
               _ => FaultAction.Retry
           },
           _ => FaultAction.Rethrow
       };
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] usaguerrilla edited a comment on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

usaguerrilla edited a comment on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-690586257


   I will think more but one problem here is that not all commands are initiated (and sent, e.g. flow / subscribe commands) by the client. It also affects consumer. If server shuts down when consumer was sending flow command it doesn't really affect caller as no data was lost. Operation should be retried once server becomes available.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] blankensteiner commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

blankensteiner commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-743920932


   Hi @usaguerrilla 
   For the time being, we'll stick with DotPulsar retrying on known issues we believe to be temporary and fault the consumer/producer/reader on unknown issues and known issues we believe to be permanent. While giving the user the option of overwriting these defaults.
   If you still feel that the faulted state doesn't make sense and that we should always retry everything, then I welcome you to comment on my post to the Slack channel where I invite for a discussion about this very topic.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] blankensteiner commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

blankensteiner commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-691012175






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] usaguerrilla commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

usaguerrilla commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-689735744


   True DotPulsar is customizable. It is also true that vanilla one hangs from time to time if server disconnects so it is a bug (unless we expect every DotPulsar customer to discover this by themselves).
   
   IMHO it should be addressed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] usaguerrilla commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

usaguerrilla commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-690586257


   I will think more but one problem here is that not all commands are initiated by the client. It also affects consumer. If server shuts down when consumer was sending flow command it doesn't really affect caller as no data was lost. Operation should be retried once server becomes available.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] usaguerrilla commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

usaguerrilla commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-690640525


   In some sense if we bail on reconnecting and mark consumer / producer as invalid in some cases, why not bail all the time? IMHO consistency is way better than having stuff like we do reconnect but only sometimes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] usaguerrilla commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

usaguerrilla commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-690628136


   So consumer is trying to reconnect during server shutting down (or crash) and it gets this:
   
   ```Unable to read data from the transport connection: An established connection was aborted by the software in your host machine```
   
   Exception type is: System.IO.IOException
   
   This isn't an action which is started by the client. It is internal code. This is just an example. Need to think about it.
   
   ```
    	DotPulsar.dll!DotPulsar.Internal.Process.Handle(DotPulsar.Internal.Abstractions.IEvent e) Line 51	C#
    	DotPulsar.dll!DotPulsar.Internal.ProcessManager.Register(DotPulsar.Internal.Abstractions.IEvent e) Line 79	C#
   >	DotPulsar.dll!DotPulsar.Internal.Executor.Handle(System.Exception exception, System.Threading.CancellationToken cancellationToken) Line 176	C#
    	DotPulsar.dll!DotPulsar.Internal.Executor.Execute<DotPulsar.Internal.Abstractions.IConsumerChannel>(System.Func<System.Threading.Tasks.ValueTask<DotPulsar.Internal.Abstractions.IConsumerChannel>> func, System.Threading.CancellationToken cancellationToken) Line 155	C#
    	[Resuming Async Method]	
    	System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder<DotPulsar.Internal.Abstractions.IConsumerChannel>.AsyncStateMachineBox<DotPulsar.Internal.Executor.<Execute>d__9<DotPulsar.Internal.Abstractions.IConsumerChannel>>.MoveNext(System.Threading.Thread threadPoolThread)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.TaskAwaiter.OutputWaitEtwEvents.AnonymousMethod__12_0(System.Action innerContinuation, System.Threading.Tasks.Task innerTask)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action action, bool allowInlining)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(object continuationObject)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.Task<bool>.TrySetResult(bool result)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder<bool>.SetResult(bool result)	Unknown
    	[Completed] DotPulsar.dll!DotPulsar.Internal.Executor.Handle(System.Exception exception, System.Threading.CancellationToken cancellationToken) Line 181	C#
    	System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder<bool>.AsyncStateMachineBox<DotPulsar.Internal.Executor.<Handle>d__10>.MoveNext(System.Threading.Thread threadPoolThread)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.TaskAwaiter.OutputWaitEtwEvents.AnonymousMethod__12_0(System.Action innerContinuation, System.Threading.Tasks.Task innerTask)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action action, bool allowInlining)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(object continuationObject)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.Task<System.Threading.Tasks.VoidTaskResult>.TrySetResult(System.Threading.Tasks.VoidTaskResult result)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder.SetResult()	Unknown
    	[Completed] DotPulsar.dll!DotPulsar.Internal.ExceptionHandlerPipeline.OnException(DotPulsar.ExceptionContext exceptionContext) Line 38	C#
    	System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>.AsyncStateMachineBox<DotPulsar.Internal.ExceptionHandlerPipeline.<OnException>d__2>.MoveNext(System.Threading.Thread threadPoolThread)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.TaskAwaiter.OutputWaitEtwEvents.AnonymousMethod__12_0(System.Action innerContinuation, System.Threading.Tasks.Task innerTask)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action action, bool allowInlining)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(object continuationObject)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.Task<System.Threading.Tasks.VoidTaskResult>.TrySetResult(System.Threading.Tasks.VoidTaskResult result)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder.SetResult()	Unknown
    	[Completed] DotPulsar.dll!DotPulsar.Internal.DefaultExceptionHandler.OnException(DotPulsar.ExceptionContext exceptionContext) Line 41	C#
    	System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>.AsyncStateMachineBox<DotPulsar.Internal.DefaultExceptionHandler.<OnException>d__2>.MoveNext(System.Threading.Thread threadPoolThread)	Unknown
    	System.Private.CoreLib.dll!System.Runtime.CompilerServices.TaskAwaiter.OutputWaitEtwEvents.AnonymousMethod__12_0(System.Action innerContinuation, System.Threading.Tasks.Task innerTask)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action action, bool allowInlining)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(object continuationObject)	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.Task.TrySetResult()	Unknown
    	System.Private.CoreLib.dll!System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()	Unknown
    	System.Private.CoreLib.dll!System.Threading.TimerQueueTimer.CallCallback(bool isThreadPool)	Unknown
    	System.Private.CoreLib.dll!System.Threading.TimerQueueTimer.Fire(bool isThreadPool)	Unknown
    	System.Private.CoreLib.dll!System.Threading.TimerQueue.FireNextTimers()	Unknown
    	[Async Call Stack]	
    	[Async] DotPulsar.dll!DotPulsar.Internal.ConsumerChannelFactory.Create(System.Threading.CancellationToken cancellationToken) Line 67	C#
    	[Async] DotPulsar.dll!DotPulsar.Internal.ConsumerProcess.SetupChannel() Line 87	C#
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] blankensteiner commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

blankensteiner commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-689827958

Hi @usaguerrilla

Well, that doesn't really answer my questions.

There are 3 kinds of exceptions.
The ones where we know that retrying is the right thing to do.
The ones where we know that retrying don't solve the problem and we, therefore, fault the consumer/producer/reader, thereby letting the user know what something is wrong.
The ones where we don't know what the right action is. In those cases, we fault to consumer/producer/reader instead of risking retrying endlessly.

We want to provide the best default actions and more importantly, we want to provide a way of defining your know logic.

The 'Faulted' state is a final state, meaning that no retry/reconnect will ever happen, so the user is expected to handle that. This is by design, so it's not hanging but simply faulted.

So, the question is what you are proposing.
Do you want to discontinue the 'Faulted' state and always retry/reconnect no matter what kind of exception we get?
Or are you suggesting changes to the default to retry more often?

If the former, then writing a handler to monitor the exceptions and logging/alerting would be required by all users because otherwise, they will never know when some error makes the consumer/producer/reader retry/reconnect eternally.

If the latter, then I think IOException is a very broad exception to always retry, and in the case of PersistenceException, I might worry of how to let the user know that something needs human attention, if not by faulting and throwing an exception?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] blankensteiner commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

blankensteiner commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-691012175

Hi @usaguerrilla

Bailing all the time will mean that every user needs to handle retry/reconnect, which in turn means that it will make sense for DotPulsar to handle it, and then we are back at looking for a strategy for doing this.

Never bailing (always retrying/reconnecting) is maybe an option, but then we should remove the 'Faulted' state.
This leaves us with the question if retrying on exceptions like "Topic terminated", "Topic not found", "Authentication/Authorization exception", "Checksum exception", "Incompatible schema exception", "Invalid topic name exception", "Metadata exception", "Unsupported version exception" and "Subscription not found exception" is a good idea? Some of these we can safely say can't be fixed and thereby it's pointless to keep retrying. This forces us to keep the 'Faulted' state and use our judgment to make some sane defaults.
We could change the default to retry/reconnect on more exceptions and/or change the default action for unknown exceptions to be 'Retry' but the user still needs to provide a handler to log the exceptions (and to create alerts is needed) and possibly also to change some default behavior. Then we are back at what you are objecting to.

So, I would love to have a discussion about this, I am open to ideas, but I need a concrete change description before I can evaluate the pros and cons of this compared to the current implementation.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] blankensteiner commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

blankensteiner commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-689474555


   Hi @usaguerrilla 
   The aim is to make DotPulsar customizable so that the user can decide when to retry and when to fault the consumer/reader/producer.
   No matter what the defaults are there will be use cases where a user wants to change the defaults, which is why we have the handler.
   I'm not sure what you are suggesting here? To change the default to retry on more exceptions or change it to always retry on everything? (given your dislike for basing retry logic on exceptions.... alternatives are welcome btw).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] blankensteiner commented on issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

blankensteiner commented on issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56#issuecomment-691012175






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-dotpulsar] blankensteiner closed issue #56: If server dies Consumer / Producer Channel sometimes goes into faulted state and never tries to reconnect

Posted by GitBox <gi...@apache.org>.

blankensteiner closed issue #56:
URL: https://github.com/apache/pulsar-dotpulsar/issues/56


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org