You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Aditya (JIRA)" <ji...@apache.org> on 2018/03/11 15:52:00 UTC

[jira] [Created] (SAMZA-1613) Stream-table join: Intermediate streams do not inherit bootstrap stream semantic

Aditya created SAMZA-1613:
-----------------------------

             Summary: Stream-table join: Intermediate streams do not inherit bootstrap stream semantic
                 Key: SAMZA-1613
                 URL: https://issues.apache.org/jira/browse/SAMZA-1613
             Project: Samza
          Issue Type: Bug
            Reporter: Aditya


There are two issues with repartitioning a bootstrap stream: 
 * Samza does not propagate the bootstrap semantics to the intermediate stream. 
 * For a bootstrap stream, Samza gets the newest offset for that stream before consumption and considers the bootstrap to be complete once the message with that offset is consumed. This is clearly a problem for intermediate streams as there might not be any messages at the beginning of the job. 

Even though the first stage abides by the bootstrap semantics and does not consume from non-bootstrap streams until all the bootstrap stream partitions owned by that container are re-partitioned, there will be scenarios where one container finishes bootstrap faster than others and also the rate of consumption from the intermediate stream in the second stage might be slower. All these would result in the job breaking the bootstrap semantics across the two stages. Please note that this issue is not just specific to repartitioning but could happen with any intermediate streams. 

To support this, we will have to solve both the afore-mentioned issues: 
 * Propagate bootstrap semantics to the intermediate streams if the repartitioned stream is bootstrap stream. 
 * Either add end of bootstrap control message (note that all the containers have to co-ordinate here) or wait for event time support in Samza. 

This is clearly a problem for Stream-Table join. Consider the scenario where the load on the stream is much higher than the load on the stream marked as table. Consequently, the number of partitions in the stream will be higher than that of the table. To join these two, we will have to increase the number of partitions in the table by repartitioning, which is where we end up in the premature join of stream with the table, before seeding the table. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)