You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Joshua Hartman (JIRA)" <ji...@apache.org> on 2015/10/12 20:25:05 UTC

[jira] [Commented] (SAMZA-794) Add the ability to manually commit specific offsets

    [ https://issues.apache.org/jira/browse/SAMZA-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953515#comment-14953515 ] 

Joshua Hartman commented on SAMZA-794:
--------------------------------------

Hi dev team - Jake did a great job describing what we would like to do. Essentially we would like the ability to commit a specific offset as part of our samza job so that we can use a threaded model for async reads over the network. The actual response processing we still do in a single-thread for simplicity and to avoid clobbering other users in the YARN cluster. We have a pretty significant amount of in-memory state (e.g. replicated bloom filters) so using a large number of partitions is a wasteful approach.

> Add the ability to manually commit specific offsets 
> ----------------------------------------------------
>
>                 Key: SAMZA-794
>                 URL: https://issues.apache.org/jira/browse/SAMZA-794
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Jake Maes
>
> Background:
> At LinkedIn we have seen a number of cases where users need to make remote calls in their Samza jobs. If the remote calls are done synchronously and single-threaded, they add significant delay to the Samza event loop. To mitigate this, users could increase the container count to get more parallelism, but that allocates excessive resources in order to wait on IO. Instead, some users employ the following approach:
> * Implement a threadpool to do the IO asynchronously
> * In window() wait for all threads to complete
> * Commit the offsets manually (auto commit is turned off)
> Problem:
> The above approach works but has the effect of blocking incoming requests while they wait for the others to flush. The throughput comes and goes in spurts. 
> One way to fix this throughput issue is:
> * Put the offsets from the IncomingMessageEnvelopes into a queue
> * Associate the offset with the IO thread
> * When the IO thread completes, add the offset to a set
> * Periodically pop as many items off the queue as possible (by first checking if they exist in the set) and then commit the last offset we were able to pop from the queue.
> Unfortunately, the TaskCoordinator doesn't expose a commit() for specific offsets and instead uses the lastProcessedOffsets from the OffsetManager. 
> The goal of this ticket is to add the ability to commit user-specified offsets to support the scenario above. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)