You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Joel Koshy (JIRA)" <ji...@apache.org> on 2013/06/13 02:13:22 UTC

[jira] [Commented] (KAFKA-937) ConsumerFetcherThread can deadlock

    [ https://issues.apache.org/jira/browse/KAFKA-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681787#comment-13681787 ] 

Joel Koshy commented on KAFKA-937:
----------------------------------

+1 on the patch.

Additionally, can you make this small (unrelated change) -  make the console consumer's autoCommitIntervalOpt default to ConsumerConfig.AutoCommitInterval ?

I think it is worth documenting the typical path of getting into the above deadlock:
- Assume at least two fetchers F1, F2
- One or more partitions on F1 go into error and leader finder thread L is notified
- L unblocks and proceeds to handle partitions without leader. It holds the ConsumerFetcherManager's lock at this point.
- All partitions on F2 go into error.
- F2's handlePartitionsWithError removes partitions from its fetcher's partitionMap. (At this point, F2 is by definition an idle fetcher thread.)
- L tries to shutdown idle fetcher threads - i.e., tries to shutdown F2.
- However, F2 at this point is trying to addPartitionsWithError which needs to acquire the ConsumerFetcherManager's lock (which is currently held by L).

It is relatively rare in the sense that it can happen only if all partitions on the fetcher are in error. This could happen for example if all the leaders for those partitions move or become unavailable. Another instance where this may be seen in practice is mirroring: we ran into it when running the mirror maker with a very large number of producers and ran out of file handles. Running out of file handles could easily lead to exceptions on most/all fetches and result in an error state for all partitions.

                
> ConsumerFetcherThread can deadlock
> ----------------------------------
>
>                 Key: KAFKA-937
>                 URL: https://issues.apache.org/jira/browse/KAFKA-937
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8
>            Reporter: Jun Rao
>            Assignee: Jun Rao
>         Attachments: kafka-937.patch
>
>
> We have the following access pattern that can introduce a deadlock.
> AbstractFetcherThread.processPartitionsWithError() ->
> ConsumerFetcherThread.processPartitionsWithError() -> 
> ConsumerFetcherManager.addPartitionsWithError() wait for lock ->
> LeaderFinderThread holding lock while calling AbstractFetcherManager.shutdownIdleFetcherThreads() ->
> AbstractFetcherManager calling fetcher.shutdown, which needs to wait until AbstractFetcherThread.processPartitionsWithError() completes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira