You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Jake Maes (JIRA)" <ji...@apache.org> on 2015/10/12 20:20:05 UTC

[jira] [Updated] (SAMZA-794) Add the ability to manually commit specific offsets

     [ https://issues.apache.org/jira/browse/SAMZA-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Maes updated SAMZA-794:
----------------------------
    Description: 
Background:
At LinkedIn we have seen a number of cases where users need to make remote calls in their Samza jobs. If the remote calls are done synchronously and single-threaded, they add significant delay to the Samza event loop. To mitigate this, users could increase the container count to get more parallelism, but that allocates excessive resources in order to wait on IO. Instead, some users employ the following approach:
* Implement a threadpool to do the IO asynchronously
* In window() wait for all threads to complete
* Commit the offsets manually (auto commit is turned off)

Problem:
The above approach works but has the effect of blocking incoming requests while they wait for the others to flush. The throughput comes and goes in spurts. 

One way to fix this throughput issue is:
* Put the offsets from the IncomingMessageEnvelopes into a queue
* Associate the offset with the IO thread
* When the IO thread completes, add the offset to a set
* Periodically pop as many items off the queue as possible (by first checking if they exist in the set) and then commit the last offset we were able to pop from the queue.

Unfortunately, the TaskCoordinator doesn't expose a commit() for specific offsets and instead uses the lastProcessedOffsets from the OffsetManager. 

The goal of this ticket is to add the ability to commit user-specified offsets to support the scenario above. 

  was:
Background:
At LinkedIn we have seen a number of cases where users need to make remote calls in their Samza jobs. If the remote calls are done synchronously and single-threaded, they add significant delay to the Samza event loop. To mitigate this, users could increase the container count to get more parallelism, but that allocates excessive resources in order to wait on IO. Instead, some users employ the following approach:
* Implement a threadpool to do the IO asynchronously
* In window() wait for all threads to complete
* Commit the offsets manually (auto commit is turned off)

This works but has the effect of blocking incoming requests while they wait for the others to flush. The throughput comes and goes in spurts. 

One way to fix this throughput issue is:
* Put the offsets from the IncomingMessageEnvelopes into a queue
* Associate the offset with the IO thread
* When the IO thread completes, add the offset to a set
* Periodically pop as many items off the queue as possible (by first checking if they exist in the set) and then commit the last offset we were able to pop from the queue.

Unfortunately, the TaskCoordinator doesn't expose a commit() for specific offsets and instead uses the lastProcessedOffsets from the OffsetManager. 

The goal of this ticket is to add the ability to commit user-specified offsets to support the scenario above. 


> Add the ability to manually commit specific offsets 
> ----------------------------------------------------
>
>                 Key: SAMZA-794
>                 URL: https://issues.apache.org/jira/browse/SAMZA-794
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Jake Maes
>
> Background:
> At LinkedIn we have seen a number of cases where users need to make remote calls in their Samza jobs. If the remote calls are done synchronously and single-threaded, they add significant delay to the Samza event loop. To mitigate this, users could increase the container count to get more parallelism, but that allocates excessive resources in order to wait on IO. Instead, some users employ the following approach:
> * Implement a threadpool to do the IO asynchronously
> * In window() wait for all threads to complete
> * Commit the offsets manually (auto commit is turned off)
> Problem:
> The above approach works but has the effect of blocking incoming requests while they wait for the others to flush. The throughput comes and goes in spurts. 
> One way to fix this throughput issue is:
> * Put the offsets from the IncomingMessageEnvelopes into a queue
> * Associate the offset with the IO thread
> * When the IO thread completes, add the offset to a set
> * Periodically pop as many items off the queue as possible (by first checking if they exist in the set) and then commit the last offset we were able to pop from the queue.
> Unfortunately, the TaskCoordinator doesn't expose a commit() for specific offsets and instead uses the lastProcessedOffsets from the OffsetManager. 
> The goal of this ticket is to add the ability to commit user-specified offsets to support the scenario above. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)