You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2019/10/07 23:23:00 UTC

[jira] [Assigned] (KAFKA-8994) Streams should expose standby replication information & allow stale reads of state store

     [ https://issues.apache.org/jira/browse/KAFKA-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Chandar reassigned KAFKA-8994:
-------------------------------------

    Assignee: Vinoth Chandar

> Streams should expose standby replication information & allow stale reads of state store
> ----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-8994
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8994
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>
> Currently Streams interactive queries (IQ) fail during the time period where there is a rebalance in progress. 
> Consider the following scenario in a three node Streams cluster with node A, node S and node R, executing a stateful sub-topology/topic group with 1 partition and `_num.standby.replicas=1_`  
>  * *t0*: A is the active instance owning the partition, B is the standby that keeps replicating the A's state into its local disk, R just routes streams IQs to active instance using StreamsMetadata
>  * *t1*: IQs pick node R as router, R forwards query to A, A responds back to R which reverse forwards back the results.
>  * *t2:* Active A instance is killed and rebalance begins. IQs start failing to A
>  * *t3*: Rebalance assignment happens and standby B is now promoted as active instance. IQs continue to fail
>  * *t4*: B fully catches up to changelog tail and rewinds offsets to A's last commit position, IQs continue to fail
>  * *t5*: IQs to R, get routed to B, which is now ready to serve results. IQs start succeeding again
>  
> Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can take few seconds (~10 seconds based on defaults values). Depending on how laggy the standby B was prior to A being killed, t4 can take few seconds-minutes. 
>  
> While this behavior favors consistency over availability at all times, the long unavailability window might be undesirable for certain classes of applications (e.g simple caches or dashboards). 
>  
> This issue aims to also expose information about standby B to R, during each rebalance such that the queries can be routed by an application to a standby to serve stale reads, choosing availability over consistency. 
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)